Thread: strpos() && KMP
Hello, this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function (see Cormen et al. Introduction to Algorithms, MIT Press, 2001). It also works with multibyte wchar. In worst case current brute force strpos() takes O(n * m) (n && m is length of strings) time (example: 'aaa...aaab' search in 'aaa...aaa'). KMP algo always takes O(n + m) time. To check this someone need to create a table with one text attribute, and insert several thousands record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where strpos(a, 'aaa....aab') > 0;" on current and modified version. Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'" (strpos must be faster than regex). In general, this belongs to artificial expressions. In natural language KMP is equal (execution time) current strpos() nearly. ---- Ajtkulov Pavel ajtkulov@acm.org P. S. Sorry for prime English.
Attachment
Pavel Ajtkulov wrote: > Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'" > (strpos must be faster than regex). > > The LIKE code does not use the regex engine. See src/backend/utils/adt/like.c and like_match.c - which recently got an efficiency makeover, especially for MB charsets (and most especially for UTF8). cheers andrew
Pavel Ajtkulov <ajtkulov@acm.org> writes: > this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function > (see Cormen et al. Introduction to Algorithms, MIT Press, 2001). This seems like a lot of complexity added to fix a non-problem. We've had no complaints about the speed of those functions (at least not since we changed the original O(N^2) method :-(). regards, tom lane
On Thu, Aug 02, 2007 at 01:18:22AM +0500, Pavel Ajtkulov wrote: > Hello, > > this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function > (see Cormen et al. Introduction to Algorithms, MIT Press, 2001). > > It also works with multibyte wchar. > > In worst case current brute force strpos() takes O(n * m) (n && m is length of strings) > time (example: 'aaa...aaab' search in 'aaa...aaa'). > KMP algo always takes O(n + m) time. > To check this someone need to create a table with one text attribute, and insert several thousands > record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where > strpos(a, 'aaa....aab') > 0;" on current and modified version. > > Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'" > (strpos must be faster than regex). > > In general, this belongs to artificial expressions. In natural language KMP is equal (execution time) > current strpos() nearly. Do you have any performance test results for this? -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
> Do you have any performance test results for this? I describe the worst case in first message (search 'aaa..aab' in 'aa..aa', "complete" N^2). It works some msec instead of several sec (in current version). Patch intends for artificial language (for example DNA, or language with small alphabet, or regular language) only. In natural language, KMP(and other search algo) haven't notable advantages (+-5% time execution). ---- Ajtkulov Pavel ajtkulov@acm.org
On Wed, Aug 08, 2007 at 12:19:53AM +0500, Pavel Ajtkulov wrote: > > > Do you have any performance test results for this? > > I describe the worst case in first message (search 'aaa..aab' in > 'aa..aa', "complete" N^2). It works some msec instead of several sec > (in current version). You describe the test case but don't provide any results. Providing the test case so others can duplicate your results is good, but you should provide actual results as well. You should also make sure to test any cases where performance would actually degrade so we can see what that looks like (though I don't know if that's an issue in this case). -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
Pavel Ajtkulov <ajtkulov@acm.org> writes: > Patch intends for artificial language (for example DNA, or > language with small alphabet, or regular language) only. > In natural language, KMP(and other search algo) haven't notable > advantages (+-5% time execution). I wonder why you didn't propose Boyer-Moore instead, as that would have some advantage for natural language text as well. The difficulty with B-M is the need for a table indexed by character code, which at first glance looks impractical for wchars. But it seems to me that we could use "wchar % 256" as the table index, meaning that wchars with the same trailing byte share the same table entry. That would lose some efficiency compared to an exact implementation, but the limited table size would outweigh that except in the most pathological cases. regards, tom lane
Tom Lane writes: > I wonder why you didn't propose Boyer-Moore instead, as that would have > some advantage for natural language text as well. > The difficulty with B-M is the need for a table indexed by character > code, which at first glance looks impractical for wchars. But it seems > to me that we could use "wchar % 256" as the table index, meaning that > wchars with the same trailing byte share the same table entry. That > would lose some efficiency compared to an exact implementation, but the > limited table size would outweigh that except in the most pathological > cases. hash table? The main difficulty with BM is verification and understanding "good suffix shift" (the second part of BM) (I don't understand it entirely). ---- Ajtkulov Pavel ajtkulov@acm.org
Pavel Ajtkulov <ajtkulov@acm.org> writes: > Tom Lane writes: >> The difficulty with B-M is the need for a table indexed by character >> code, which at first glance looks impractical for wchars. But it seems >> to me that we could use "wchar % 256" as the table index, meaning that >> wchars with the same trailing byte share the same table entry. > hash table? I'd think the cost of hashing would render it impractical. Most of the papers I've seen on this topic worry about getting single instructions out of the search loop --- a hash lookup will cost lots more than that. Moreover, you'd lose the guarantee of not-worse-than-linear time, because hash lookup can be pathologically bad if you get a lot of hash collisions. > The main difficulty with BM is verification and understanding "good > suffix shift" (the second part of BM) (I don't understand it entirely). Yeah, there seem to be a bunch of variants of BM (many of them not guaranteed linear, which I'm sure we don't want) and the earliest papers had bugs. But a good implementation would be a lot easier sell because it would show benefits for a much wider set of use-cases than KMP. regards, tom lane
Tom Lane writes: >> hash table? > I'd think the cost of hashing would render it impractical. Most of the > papers I've seen on this topic worry about getting single instructions > out of the search loop --- a hash lookup will cost lots more than that. > Moreover, you'd lose the guarantee of not-worse-than-linear time, > because hash lookup can be pathologically bad if you get a lot of hash > collisions. compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for example, k = 1000), then we use index table (wchar -> wchar - min_wchar). Else we use hash table. Number of collisions would be a few (because hash table needs for pattern characters only. Characters located serially, hash function = whchar % const). >> The main difficulty with BM is verification and understanding "good >> suffix shift" (the second part of BM) (I don't understand it entirely). > Yeah, there seem to be a bunch of variants of BM (many of them not > guaranteed linear, which I'm sure we don't want) and the earliest > papers had bugs. But a good implementation would be a lot easier > sell because it would show benefits for a much wider set of use-cases > than KMP. Is there requirement for some string mathching algorithms/data structure(suffix array/tree) in PG? or "We've had no complaints about the speed of those functions". ---- Ajtkulov Pavel ajtkulov@acm.org
Pavel Ajtkulov <ajtkulov@acm.org> writes: > Tom Lane writes: >> Moreover, you'd lose the guarantee of not-worse-than-linear time, >> because hash lookup can be pathologically bad if you get a lot of hash >> collisions. > compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for > example, k = 1000), then we use index table (wchar -> wchar - > min_wchar). Else we use hash table. Number of collisions would be a > few (because hash table needs for pattern characters only. I think you missed my point: there's a significant difference between "guaranteed good performance" and "probabilistically good performance". Even when the probably-good algorithm wins for typical cases, there's a strong argument to be made for guarantees. The problem you set out to solve really is that an algorithm that's all right in everyday cases will suck in certain uncommon cases --- so why do you want to fix it by just moving around the cases in which it fails to do well? regards, tom lane
Added to TODO: > * Implement Boyer-Moore searching in strpos() > > http://archives.postgresql.org/pgsql-patches/2007-08/msg00012.php --------------------------------------------------------------------------- Pavel Ajtkulov wrote: > Hello, > > this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function > (see Cormen et al. Introduction to Algorithms, MIT Press, 2001). > > It also works with multibyte wchar. > > In worst case current brute force strpos() takes O(n * m) (n && m is length of strings) > time (example: 'aaa...aaab' search in 'aaa...aaa'). > KMP algo always takes O(n + m) time. > To check this someone need to create a table with one text attribute, and insert several thousands > record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where > strpos(a, 'aaa....aab') > 0;" on current and modified version. > > Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'" > (strpos must be faster than regex). > > In general, this belongs to artificial expressions. In natural language KMP is equal (execution time) > current strpos() nearly. > > > ---- > Ajtkulov Pavel > ajtkulov@acm.org > > P. S. Sorry for prime English. [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +