Thread: Questions about indexes with text_pattern_ops

Questions about indexes with text_pattern_ops

From

"Kaare Rasmussen"

Date:

25 February 2008, 09:15:09

Hi 

The database is initialized with utf8, so in order for LIKE to use the index 
on a text field, I used text_pattern_ops when I created it. So far so good. 

It's in the documentation, but there's no explanation of why this index will 
only work for LIKE searches. How come that I have to have two different 
indexes if I want to give Postgres the ability to choose index scan over seq 
scan on LIKE and non-LIKE searches? 

Is it a performance issue? 

Also, when I tried to create the index as a partial one (avoiding the 95% 
entries with empty strings), Postgresql chooses to use seq scan. This sounds 
counter intuitive to me. 

CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b <> '';
This is 8.2.6.

Re: Questions about indexes with text_pattern_ops

From

Gregory Stark

Date:

25 February 2008, 11:00:47

"Kaare Rasmussen" <kaare@jasonic.dk> writes:

> Hi 
>
> The database is initialized with utf8, so in order for LIKE to use the index on
> a text field, I used text_pattern_ops when I created it. So far so good. 
>
> It's in the documentation, but there's no explanation of why this index will
> only work for LIKE searches. How come that I have to have two different indexes
> if I want to give Postgres the ability to choose index scan over seq scan on
> LIKE and non-LIKE searches? 

Because in non-C locales (which you're almost certainly using if you're using
UTF8) the ordering which the normal text operations use can be quite complex.
Just as an example most locales have spaces being entirely insignificant. So
no range can reliably match a prefix LIKE pattern. The text_pattern_ops use
simple character-by-character ordering which are useful for LIKE but not for
regular < and > comparisons. They're just two different orderings.

> Also, when I tried to create the index as a partial one (avoiding the 95%
> entries with empty strings), Postgresql chooses to use seq scan. This sounds
> counter intuitive to me. 
>
> CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b <> '';
> This is 8.2.6.

Hm, for a simple = or <> I think it doesn't matter which operator class you
use. For < or > it would produce different answers. Postgres isn't clever enough
to notice that this is equivalent though so I think you would have to do
something like (untested):

CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ '';

That uses the same operator that the LIKE clause will use for the index range.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: Questions about indexes with text_pattern_ops

From

Tom Lane

Date:

25 February 2008, 11:51:24

Gregory Stark <stark@enterprisedb.com> writes:
> Hm, for a simple = or <> I think it doesn't matter which operator class you
> use. For < or > it would produce different answers. Postgres isn't clever enough
> to notice that this is equivalent though so I think you would have to do
> something like (untested):

> CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ '';

> That uses the same operator that the LIKE clause will use for the index range.

I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any
reason why those slots in the pattern_ops classes can't be filled by the
plain = and <> operators.  (There *was* a reason when they were first
invented --- but now that texteq will only return true for exact bitwise
match, I think it's OK to assume these are equivalent.)

In the meantime, though, I think the only way that Kaare's query can use
that index is if he writesWHERE b LIKE 'whatever' AND b <> '';
(with whatever spelling of <> the index predicate has).  There is not
anything in the predicate proving machinery that knows enough about LIKE
to be able to show that "b LIKE 'whatever'" implies "b <> ''".
        regards, tom lane

Re: Questions about indexes with text_pattern_ops

From

Gregory Stark

Date:

25 February 2008, 12:14:25

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Hm, for a simple = or <> I think it doesn't matter which operator class you
>> use. For < or > it would produce different answers. Postgres isn't clever enough
>> to notice that this is equivalent though so I think you would have to do
>> something like (untested):
>
>> CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ '';
>
>> That uses the same operator that the LIKE clause will use for the index range.
>
> I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any
> reason why those slots in the pattern_ops classes can't be filled by the
> plain = and <> operators.  (There *was* a reason when they were first
> invented --- but now that texteq will only return true for exact bitwise
> match, I think it's OK to assume these are equivalent.)

The only question is whether we'll keep that forever. I thought it was a good
idea at the time but I'm starting to wonder about the implications for
multi-key indexes.

> In the meantime, though, I think the only way that Kaare's query can use
> that index is if he writes
>     WHERE b LIKE 'whatever' AND b <> '';
> (with whatever spelling of <> the index predicate has).  There is not
> anything in the predicate proving machinery that knows enough about LIKE
> to be able to show that "b LIKE 'whatever'" implies "b <> ''".

I was thinking that the inequalities that the LIKE index scan introduces would
imply the inequality. I take it we generate those inequalities too late in the
planning process to use them for other planning? 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

Re: Questions about indexes with text_pattern_ops

From

Tom Lane

Date:

25 February 2008, 12:30:06

Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any
>> reason why those slots in the pattern_ops classes can't be filled by the
>> plain = and <> operators.  (There *was* a reason when they were first
>> invented --- but now that texteq will only return true for exact bitwise
>> match, I think it's OK to assume these are equivalent.)

> The only question is whether we'll keep that forever. I thought it was a good
> idea at the time but I'm starting to wonder about the implications for
> multi-key indexes.

How so?  If you think this change is a bad idea you'd better speak up
PDQ.

>> In the meantime, though, I think the only way that Kaare's query can use
>> that index is if he writes
>> WHERE b LIKE 'whatever' AND b <> '';
>> (with whatever spelling of <> the index predicate has).  There is not
>> anything in the predicate proving machinery that knows enough about LIKE
>> to be able to show that "b LIKE 'whatever'" implies "b <> ''".

> I was thinking that the inequalities that the LIKE index scan introduces would
> imply the inequality. I take it we generate those inequalities too late in the
> planning process to use them for other planning? 

Hmm, good point [ experiments... ]  Yeah, it seems we don't reconsider
partial indexes after those clauses have been generated.  Not sure how
expensive it'd be to change that.
        regards, tom lane

Re: Questions about indexes with text_pattern_ops

From

Gregory Stark

Date:

25 February 2008, 13:10:50

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any
>>> reason why those slots in the pattern_ops classes can't be filled by the
>>> plain = and <> operators.  (There *was* a reason when they were first
>>> invented --- but now that texteq will only return true for exact bitwise
>>> match, I think it's OK to assume these are equivalent.)
>
>> The only question is whether we'll keep that forever. I thought it was a good
>> idea at the time but I'm starting to wonder about the implications for
>> multi-key indexes.
>
> How so?  If you think this change is a bad idea you'd better speak up
> PDQ.

Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. 

But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if
the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before
'b'.

I'm starting to think "equality" and "sorts interchangeably" are not the same
operator at all. On the other hand it may not be worth the complexity that
might bring.

>> I was thinking that the inequalities that the LIKE index scan introduces would
>> imply the inequality. I take it we generate those inequalities too late in the
>> planning process to use them for other planning? 
>
> Hmm, good point [ experiments... ]  Yeah, it seems we don't reconsider
> partial indexes after those clauses have been generated.  Not sure how
> expensive it'd be to change that.

Perhaps we should always generate those inequalities even if there's no index
that can use them. Then calculate the regexp selectivity based only on the
regexp after the prefix.

That might also help constraint exclusion. Maybe merge joins?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: Questions about indexes with text_pattern_ops

From

Tom Lane

Date:

25 February 2008, 13:40:52

Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> How so?  If you think this change is a bad idea you'd better speak up
>> PDQ.

> Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. 

> But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if
> the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before
> 'b'.

I don't think we can concern ourselves with that; it would require
allowing different columns of an index to interact, which would be
impossibly messy.  What's more, it'd destroy the property that a btree
index is sorted by its leading column(s) as well as by all its columns.

> Perhaps we should always generate those inequalities even if there's no index
> that can use them.

Hmmm ... we intentionally don't do that, but the constraint exclusion
code might be a sufficient reason to reconsider.
        regards, tom lane

Re: Questions about indexes with text_pattern_ops

From

Gregory Stark

Date:

25 February 2008, 14:06:46

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> How so?  If you think this change is a bad idea you'd better speak up
>>> PDQ.
>
>> Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. 
>
>> But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if
>> the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before
>> 'b'.
>
> I don't think we can concern ourselves with that; it would require
> allowing different columns of an index to interact, which would be
> impossibly messy.  What's more, it'd destroy the property that a btree
> index is sorted by its leading column(s) as well as by all its columns.

Well, I was thinking we might have to separate the equal operators from the
btree opclass. Equals would be a stricter property than "sorts as same".

It would be mighty strange to have values which compare unequal but are
neither < or > though. Or which compare equal but also compare < or >.

It might be a little less surprising if we invent a new operator === for
"actually the same" and have == report whether two objects sort as equals. But
I'm not sure our experience with Turkish doesn't show that that will still
surprise people.

It may be more right in an abstract ideal world -- the reality is that text
collation is annoyingly complex. But this may be a case where we can get away
with just eliding this hassle.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: Questions about indexes with text_pattern_ops

From

Tom Lane

Date:

25 February 2008, 14:54:38

Gregory Stark <stark@enterprisedb.com> writes:
> It may be more right in an abstract ideal world -- the reality is that text
> collation is annoyingly complex. But this may be a case where we can get away
> with just eliding this hassle.

If anyone actually complains about it, I think we can point to the SQL
spec, which unambiguously says that a multicolumn sort key is considered
one column at a time.
        regards, tom lane