Re: [GENERAL] indexed regex select optimisation missing? - Mailing list pgsql-general
From | Ross J. Reedstrom |
---|---|
Subject | Re: [GENERAL] indexed regex select optimisation missing? |
Date | |
Msg-id | 19991104160621.B11924@wallace.ece.rice.edu Whole thread Raw |
In response to | Re: [GENERAL] indexed regex select optimisation missing? (Stuart Woolford <stuartw@newmail.net>) |
Responses |
Re: [GENERAL] indexed regex select optimisation missing?
Re: [GENERAL] indexed regex select optimisation missing? |
List | pgsql-general |
Ah, your description just tripped a memory for me from the hackers list: The behavior you describe has to do with the implementation of using an index for regex matching, in the presence of the USE_LOCALE configuration option. Internally, the condition: WHERE word~'^alongword' is converted in the parser(!) to: WHERE word >= 'alongword' AND word < 'alongword\377' since the index needs inequalities to be used, not matches. Now, the problem is the hack of tacking an octal \377 on the string to create the lexagraphically 'just bigger' value assumes ASCI sort order. If USE_LOCALE is defined, this is dropped, since we don't have a good fix yet, and slow correct behavior is better than fast, incorrect behavior. So, you have two options: if you don't need locale support, recompile without it. Otherwise, hand code your anchored matches as the pair of conditionals above Hmm, is there syntax for adding an arbitrary value to a string constant in the SQL? I suppose you could use: word < 'alongwore', i.e. hand increment the last character, so it's larger than any match. Your point is correct, the developers are aware of it as a theoretical problem, at least. Always helps to hear a real world case, though. I believe it's on the TODO list as is, otherwise, pester Bruce. ;-) Reviewing my email logs from June, most of the work on this has to do with people who needs locales, and potentially multibyte character sets. Tom Lane is of the opinion that this particular optimization needs to be moved out of the parser, and deeper into the planner or optimizer/rewriter, so a good fix may be some ways out. Ross On Fri, Nov 05, 1999 at 10:12:06AM +1300, Stuart Woolford wrote: > > My point is that, while the index (in 6.5.1 and 6.5.2, anyway) is used to locate > the start of the scan, the system is then index-scanning the *whole* rest of the > table (which takes minutes for my 1.6 million entry table if it is from near > the start), as opposed to using a better 'stop term' to stop scanning once the > regex will no longer be able to match (ie: the static front of the regex is no > longer matching), so the ordered scan is only being half utilised, this makes a > MASSIVE difference in performance. > > For example, say one of the words in the table is 'alongword', and there is > also 'alongwords', but no other words with the root of 'alongword' > [...] > > If I do a 'select key from inv_word_i where word~'^alongword' it uses the > index to find 'alongword', then does an index scan of the *whole* rest of the > table check all the rest of the entries for regex matching, so it takes a long > time, and returns the two entries detailed above, it will take almost as long > as the previous query. > > What it should do is stop as soon as the leftmost part of the regex match no > longer matches 'alongword' because, as it is scanning in indexed order, a match > is no longer possible. The query will then run at nearly the speed of the first -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
pgsql-general by date: