Thread: Undocumented(?) limits on regexp functions
All the regexp functions blow up with "invalid memory alloc request" errors when the input string exceeds 256MB in length. This restriction does not seem to be documented anywhere that I could see. (Also for regexp_split* and regexp_matches, there's a limit of 64M total matches, which also doesn't seem to be documented anywhere). Should these limits: a) be removed b) be documented c) have better error messages? -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > All the regexp functions blow up with "invalid memory alloc request" > errors when the input string exceeds 256MB in length. This restriction > does not seem to be documented anywhere that I could see. > (Also for regexp_split* and regexp_matches, there's a limit of 64M total > matches, which also doesn't seem to be documented anywhere). > Should these limits: > a) be removed Doubt it --- we could use the "huge" request variants, maybe, but I wonder whether the engine could run fast enough that you'd want to. > c) have better error messages? +1 for that, though. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Should these limits: >> a) be removed Tom> Doubt it --- we could use the "huge" request variants, maybe, but Tom> I wonder whether the engine could run fast enough that you'd want Tom> to. I do wonder (albeit without evidence) whether the quadratic slowdown problem I posted a patch for earlier was ignored for so long because people just went "meh, regexps are slow" rather than wondering why a trivial splitting of a 40kbyte string was taking more than a second. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Doubt it --- we could use the "huge" request variants, maybe, but > Tom> I wonder whether the engine could run fast enough that you'd want > Tom> to. > I do wonder (albeit without evidence) whether the quadratic slowdown > problem I posted a patch for earlier was ignored for so long because > people just went "meh, regexps are slow" rather than wondering why a > trivial splitting of a 40kbyte string was taking more than a second. I have done performance measurements on the regex stuff in the past, and not noticed any huge penalty in regexp.c. I was planning to try to figure out what test case you were using that was different from what I'd looked at, but not got round to it yet. In the light of morning I'm reconsidering my initial thought of not wanting to use MemoryContextAllocHuge. My reaction was based on thinking that that would allow people to reach indefinitely large regexp inputs, but really that's not so; the maximum input length will be a 1GB text object, hence at most 1G characters. regexp.c needs to expand that into 4-bytes-each "chr" characters, so it could be at most 4GB of data. The fact that inputs between 256M and 1G characters fail could be seen as an implementation rough edge that we ought to sand down, at least on 64-bit platforms. regards, tom lane
Moin Andrew, On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote: >>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > > >> Should these limits: > > >> a) be removed > > Tom> Doubt it --- we could use the "huge" request variants, maybe, but > Tom> I wonder whether the engine could run fast enough that you'd want > Tom> to. > > I do wonder (albeit without evidence) whether the quadratic slowdown > problem I posted a patch for earlier was ignored for so long because > people just went "meh, regexps are slow" rather than wondering why a > trivial splitting of a 40kbyte string was taking more than a second. Pretty much this. :) First of all, thank you for working in this area, this is very welcome. We do use UTF-8 and we did notice that regexp are not actually the fastest around, albeit we did not (yet) run into the memory limit. Mostly, because the regexp_match* stuff we use is only used in places where the performance is not key and the input/output is small (albeit, now that I mention it, the quadratic behaviour might explain a few slowdowns in other cases I need to investigate). Anyway, in a few places we have functions that use a lot (> a dozend) regexps that are also moderate complex (e.g. span multiple lines). In these cases the performance was not really up to par, so I experimented and in the end rewrote the functions in plperl. Which fixed the performance, so we no longer had this issue. All the best, Tels
> On Aug 14, 2018, at 10:01 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote: > > Moin Andrew, > > On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote: >>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> >>>> Should these limits: >> >>>> a) be removed >> >> Tom> Doubt it --- we could use the "huge" request variants, maybe, but >> Tom> I wonder whether the engine could run fast enough that you'd want >> Tom> to. >> >> I do wonder (albeit without evidence) whether the quadratic slowdown >> problem I posted a patch for earlier was ignored for so long because >> people just went "meh, regexps are slow" rather than wondering why a >> trivial splitting of a 40kbyte string was taking more than a second. > > Pretty much this. :) > > First of all, thank you for working in this area, this is very welcome. > > We do use UTF-8 and we did notice that regexp are not actually the fastest > around, albeit we did not (yet) run into the memory limit. Mostly, because > the regexp_match* stuff we use is only used in places where the > performance is not key and the input/output is small (albeit, now that I > mention it, the quadratic behaviour might explain a few slowdowns in other > cases I need to investigate). > > Anyway, in a few places we have functions that use a lot (> a dozend) > regexps that are also moderate complex (e.g. span multiple lines). In > these cases the performance was not really up to par, so I experimented > and in the end rewrote the functions in plperl. Which fixed the > performance, so we no longer had this issue. +1. I have done something similar, though in C rather than plperl. As for the length limit, I have only hit that in stress testing, not in practice. mark