Bug in index scans with Locale support enabled - Mailing list pgsql-hackers
From | Barry Lind |
---|---|
Subject | Bug in index scans with Locale support enabled |
Date | |
Msg-id | 3A31830F.964982C0@xythos.com Whole thread Raw |
Responses |
Re: Bug in index scans with Locale support enabled
Re: Bug in index scans with Locale support enabled |
List | pgsql-hackers |
In researching a problem I have uncovered the following bug in index scans when Locale support is enabled. Given a 7.0.3 postgres installation built with Locale support enabled and a default US RedHat 7.0 Linux installation (meaning that the LANG environment variable is set to en_US) to enable the US english locale and Given the following table and index structure with the following data: create table test (test_col text); create index test_index on test (test_col); insert into test values ('abc.xyz'); insert into test values ('abcxyz'); insert into test values ('abc/xyz'); If you run the query: select * from test where test_col >= 'abc.'; One would normally expect to only get one record returned, but instead all records are returned. The reason for this is that in the en_US locale all non-alphanumeric characters are ignored when doing string comparisons. So the data above gets treated as: abc.xyz = abcxyz = abc/xyz (as the non-alphanumeric characters of '.' and '/' are ignored). This implys that the above query will then return all rows as the constant 'abc.' is the same as 'abc' for comparison purposes and all rows are >= 'abc'. Note that if you use a different locale for example en_UK, you will get different results as this locale does not ignore the . and / in the comparison. Now the real problem comes in when either the like or regex operators are used in a sql statement. Consider the following sql: select * from text where test_col like 'abc/%'; This query should return one row, the row for 'abc/xyz'. However if the above query is executed via an index scan it will return the wrong number of rows (0 in this case). Why is this? Well the query plan created for the above like expression looks like the following: select * from text where test_col >= 'abc/' and test_col < 'abc0'; In order to use the index the like has been changed into a '>=' and a '<' for the constant prefix ('abc/') and the constant prefix with the last character incremented by one ('/abc0') (0 is the next character after / in ASCII). Given what was shown above about how the en_US locale does comparisons we know that the non-alphanumeric characters are ignored. So the query essentially becomes: select * from text where test_col >= 'abc' and test_col < 'abc0'; and the data it is comparing against is 'abcxyz' in all cases (once the .'s an /'s are removed). Therefore since 'abcxyz' > 'abc0', no rows are returned. Over the last couple of months that I have been on the postgres mail lists there have been a few people who reported that queries of the form "like '/aaa/bbb/%' don't work. From the above information I have determined that such queries don't work if: a) database is built with Locale support enabled (--enable-locale) b) the database is running with locale en_US c) the column the like is being performed on is indexed d) the query execution plan uses the above index (Discovering the exact set of circumstances for how to reproduce this has driven me crazy for a while now). The current implementation for converting the like into an index scan doesn't work with Locale support enabled and the en_US locale as shown above. thanks, --Barry PS. my test case: drop table test; create table test (test_col text); create index test_index on test (test_col); insert into test values ('abc.xyz'); insert into test values ('abcxyz'); insert into test values ('abc/xyz'); explain select * from test where test_col like 'abc/%'; select * from test where test_col like 'abc/%'; when run against postgres 7.0.3 with locale support enabled (used the standard RPMs on postgresql.org for RedHat) with LANG=en_US: barry=# drop table test; DROP barry=# create table test (test_col text); CREATE barry=# create index test_index on test (test_col); CREATE barry=# insert into test values ('abc.xyz'); INSERT 227611 1 barry=# insert into test values ('abcxyz'); INSERT 227612 1 barry=# insert into test values ('abc/xyz'); INSERT 227613 1 barry=# explain select * from test where test_col like 'abc/%'; NOTICE: QUERY PLAN: Index Scan using test_index on test (cost=0.00..8.14 rows=10 width=12) EXPLAIN barry=# select * from test where test_col like 'abc/%'; test_col ---------- (0 rows) barry=# when run against postgres 7.0.3 with locale support enabled (used the standard RPMs on postgresql.org) with LANG=en_UK: barry=# drop table test; DROP barry=# create table test (test_col text); CREATE barry=# create index test_index on test (test_col); CREATE barry=# insert into test values ('abc.xyz'); INSERT 227628 1 barry=# insert into test values ('abcxyz'); INSERT 227629 1 barry=# insert into test values ('abc/xyz'); INSERT 227630 1 barry=# explain select * from test where test_col like 'abc/%'; NOTICE: QUERY PLAN: Index Scan using test_index on test (cost=0.00..8.14 rows=10 width=12) EXPLAIN barry=# select * from test where test_col like 'abc/%'; test_col ---------- abc/xyz (1 row) barry=# Note the second query (under en_UK) returns the correct rows, but the first query (under en_US) returned the wrong number of rows. PPS. Another way to work around the problem is to turn off locale specific collation using the environment variable LC_COLLATE and setting it to the value C.
pgsql-hackers by date: