Thread: Questions about indexes with text_pattern_ops
Hi The database is initialized with utf8, so in order for LIKE to use the index on a text field, I used text_pattern_ops when I created it. So far so good. It's in the documentation, but there's no explanation of why this index will only work for LIKE searches. How come that I have to have two different indexes if I want to give Postgres the ability to choose index scan over seq scan on LIKE and non-LIKE searches? Is it a performance issue? Also, when I tried to create the index as a partial one (avoiding the 95% entries with empty strings), Postgresql chooses to use seq scan. This sounds counter intuitive to me. CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b <> ''; This is 8.2.6.
"Kaare Rasmussen" <kaare@jasonic.dk> writes: > Hi > > The database is initialized with utf8, so in order for LIKE to use the index on > a text field, I used text_pattern_ops when I created it. So far so good. > > It's in the documentation, but there's no explanation of why this index will > only work for LIKE searches. How come that I have to have two different indexes > if I want to give Postgres the ability to choose index scan over seq scan on > LIKE and non-LIKE searches? Because in non-C locales (which you're almost certainly using if you're using UTF8) the ordering which the normal text operations use can be quite complex. Just as an example most locales have spaces being entirely insignificant. So no range can reliably match a prefix LIKE pattern. The text_pattern_ops use simple character-by-character ordering which are useful for LIKE but not for regular < and > comparisons. They're just two different orderings. > Also, when I tried to create the index as a partial one (avoiding the 95% > entries with empty strings), Postgresql chooses to use seq scan. This sounds > counter intuitive to me. > > CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b <> ''; > This is 8.2.6. Hm, for a simple = or <> I think it doesn't matter which operator class you use. For < or > it would produce different answers. Postgres isn't clever enough to notice that this is equivalent though so I think you would have to do something like (untested): CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ ''; That uses the same operator that the LIKE clause will use for the index range. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark <stark@enterprisedb.com> writes: > Hm, for a simple = or <> I think it doesn't matter which operator class you > use. For < or > it would produce different answers. Postgres isn't clever enough > to notice that this is equivalent though so I think you would have to do > something like (untested): > CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ ''; > That uses the same operator that the LIKE clause will use for the index range. I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any reason why those slots in the pattern_ops classes can't be filled by the plain = and <> operators. (There *was* a reason when they were first invented --- but now that texteq will only return true for exact bitwise match, I think it's OK to assume these are equivalent.) In the meantime, though, I think the only way that Kaare's query can use that index is if he writesWHERE b LIKE 'whatever' AND b <> ''; (with whatever spelling of <> the index predicate has). There is not anything in the predicate proving machinery that knows enough about LIKE to be able to show that "b LIKE 'whatever'" implies "b <> ''". regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> Hm, for a simple = or <> I think it doesn't matter which operator class you >> use. For < or > it would produce different answers. Postgres isn't clever enough >> to notice that this is equivalent though so I think you would have to do >> something like (untested): > >> CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~<>~ ''; > >> That uses the same operator that the LIKE clause will use for the index range. > > I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any > reason why those slots in the pattern_ops classes can't be filled by the > plain = and <> operators. (There *was* a reason when they were first > invented --- but now that texteq will only return true for exact bitwise > match, I think it's OK to assume these are equivalent.) The only question is whether we'll keep that forever. I thought it was a good idea at the time but I'm starting to wonder about the implications for multi-key indexes. > In the meantime, though, I think the only way that Kaare's query can use > that index is if he writes > WHERE b LIKE 'whatever' AND b <> ''; > (with whatever spelling of <> the index predicate has). There is not > anything in the predicate proving machinery that knows enough about LIKE > to be able to show that "b LIKE 'whatever'" implies "b <> ''". I was thinking that the inequalities that the LIKE index scan introduces would imply the inequality. I take it we generate those inequalities too late in the planning process to use them for other planning? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any >> reason why those slots in the pattern_ops classes can't be filled by the >> plain = and <> operators. (There *was* a reason when they were first >> invented --- but now that texteq will only return true for exact bitwise >> match, I think it's OK to assume these are equivalent.) > The only question is whether we'll keep that forever. I thought it was a good > idea at the time but I'm starting to wonder about the implications for > multi-key indexes. How so? If you think this change is a bad idea you'd better speak up PDQ. >> In the meantime, though, I think the only way that Kaare's query can use >> that index is if he writes >> WHERE b LIKE 'whatever' AND b <> ''; >> (with whatever spelling of <> the index predicate has). There is not >> anything in the predicate proving machinery that knows enough about LIKE >> to be able to show that "b LIKE 'whatever'" implies "b <> ''". > I was thinking that the inequalities that the LIKE index scan introduces would > imply the inequality. I take it we generate those inequalities too late in the > planning process to use them for other planning? Hmm, good point [ experiments... ] Yeah, it seems we don't reconsider partial indexes after those clauses have been generated. Not sure how expensive it'd be to change that. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> I'm intending to get rid of ~=~ and ~<>~ for 8.4; there's no longer any >>> reason why those slots in the pattern_ops classes can't be filled by the >>> plain = and <> operators. (There *was* a reason when they were first >>> invented --- but now that texteq will only return true for exact bitwise >>> match, I think it's OK to assume these are equivalent.) > >> The only question is whether we'll keep that forever. I thought it was a good >> idea at the time but I'm starting to wonder about the implications for >> multi-key indexes. > > How so? If you think this change is a bad idea you'd better speak up > PDQ. Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before 'b'. I'm starting to think "equality" and "sorts interchangeably" are not the same operator at all. On the other hand it may not be worth the complexity that might bring. >> I was thinking that the inequalities that the LIKE index scan introduces would >> imply the inequality. I take it we generate those inequalities too late in the >> planning process to use them for other planning? > > Hmm, good point [ experiments... ] Yeah, it seems we don't reconsider > partial indexes after those clauses have been generated. Not sure how > expensive it'd be to change that. Perhaps we should always generate those inequalities even if there's no index that can use them. Then calculate the regexp selectivity based only on the regexp after the prefix. That might also help constraint exclusion. Maybe merge joins? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> How so? If you think this change is a bad idea you'd better speak up >> PDQ. > Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. > But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if > the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before > 'b'. I don't think we can concern ourselves with that; it would require allowing different columns of an index to interact, which would be impossibly messy. What's more, it'd destroy the property that a btree index is sorted by its leading column(s) as well as by all its columns. > Perhaps we should always generate those inequalities even if there's no index > that can use them. Hmmm ... we intentionally don't do that, but the constraint exclusion code might be a sufficient reason to reconsider. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> How so? If you think this change is a bad idea you'd better speak up >>> PDQ. > >> Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. > >> But I'm not sure it makes sense for <'foo ','a'> to sort after <'foo','b'> if >> the locale says that 'foo ' should be compare "equal" to 'foo' and 'a' before >> 'b'. > > I don't think we can concern ourselves with that; it would require > allowing different columns of an index to interact, which would be > impossibly messy. What's more, it'd destroy the property that a btree > index is sorted by its leading column(s) as well as by all its columns. Well, I was thinking we might have to separate the equal operators from the btree opclass. Equals would be a stricter property than "sorts as same". It would be mighty strange to have values which compare unequal but are neither < or > though. Or which compare equal but also compare < or >. It might be a little less surprising if we invent a new operator === for "actually the same" and have == report whether two objects sort as equals. But I'm not sure our experience with Turkish doesn't show that that will still surprise people. It may be more right in an abstract ideal world -- the reality is that text collation is annoyingly complex. But this may be a case where we can get away with just eliding this hassle. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark <stark@enterprisedb.com> writes: > It may be more right in an abstract ideal world -- the reality is that text > collation is annoyingly complex. But this may be a case where we can get away > with just eliding this hassle. If anyone actually complains about it, I think we can point to the SQL spec, which unambiguously says that a multicolumn sort key is considered one column at a time. regards, tom lane