Thread: Undocumented(?) limits on regexp functions

Undocumented(?) limits on regexp functions

From

Andrew Gierth

Date:

14 August 2018, 02:31:55

All the regexp functions blow up with "invalid memory alloc request"
errors when the input string exceeds 256MB in length. This restriction
does not seem to be documented anywhere that I could see.

(Also for regexp_split* and regexp_matches, there's a limit of 64M total
matches, which also doesn't seem to be documented anywhere).

Should these limits:

a) be removed

b) be documented

c) have better error messages?

-- 
Andrew (irc:RhodiumToad)

Re: Undocumented(?) limits on regexp functions

From

Tom Lane

Date:

14 August 2018, 02:49:11

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> All the regexp functions blow up with "invalid memory alloc request"
> errors when the input string exceeds 256MB in length. This restriction
> does not seem to be documented anywhere that I could see.

> (Also for regexp_split* and regexp_matches, there's a limit of 64M total
> matches, which also doesn't seem to be documented anywhere).

> Should these limits:

> a) be removed

Doubt it --- we could use the "huge" request variants, maybe, but
I wonder whether the engine could run fast enough that you'd want to.

> c) have better error messages?

+1 for that, though.

            regards, tom lane

Re: Undocumented(?) limits on regexp functions

From

Andrew Gierth

Date:

14 August 2018, 16:16:42

>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 >> Should these limits:

 >> a) be removed

 Tom> Doubt it --- we could use the "huge" request variants, maybe, but
 Tom> I wonder whether the engine could run fast enough that you'd want
 Tom> to.

I do wonder (albeit without evidence) whether the quadratic slowdown
problem I posted a patch for earlier was ignored for so long because
people just went "meh, regexps are slow" rather than wondering why a
trivial splitting of a 40kbyte string was taking more than a second.

-- 
Andrew (irc:RhodiumToad)

Re: Undocumented(?) limits on regexp functions

From

Tom Lane

Date:

14 August 2018, 17:48:57

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> Doubt it --- we could use the "huge" request variants, maybe, but
>  Tom> I wonder whether the engine could run fast enough that you'd want
>  Tom> to.

> I do wonder (albeit without evidence) whether the quadratic slowdown
> problem I posted a patch for earlier was ignored for so long because
> people just went "meh, regexps are slow" rather than wondering why a
> trivial splitting of a 40kbyte string was taking more than a second.

I have done performance measurements on the regex stuff in the past,
and not noticed any huge penalty in regexp.c.  I was planning to try
to figure out what test case you were using that was different from
what I'd looked at, but not got round to it yet.

In the light of morning I'm reconsidering my initial thought of
not wanting to use MemoryContextAllocHuge.  My reaction was based
on thinking that that would allow people to reach indefinitely
large regexp inputs, but really that's not so; the maximum input
length will be a 1GB text object, hence at most 1G characters.
regexp.c needs to expand that into 4-bytes-each "chr" characters,
so it could be at most 4GB of data.  The fact that inputs between
256M and 1G characters fail could be seen as an implementation
rough edge that we ought to sand down, at least on 64-bit platforms.

            regards, tom lane

Re: Undocumented(?) limits on regexp functions

From

"Tels"

Date:

14 August 2018, 20:01:48

Moin Andrew,

On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote:
>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>  >> Should these limits:
>
>  >> a) be removed
>
>  Tom> Doubt it --- we could use the "huge" request variants, maybe, but
>  Tom> I wonder whether the engine could run fast enough that you'd want
>  Tom> to.
>
> I do wonder (albeit without evidence) whether the quadratic slowdown
> problem I posted a patch for earlier was ignored for so long because
> people just went "meh, regexps are slow" rather than wondering why a
> trivial splitting of a 40kbyte string was taking more than a second.

Pretty much this. :)

First of all, thank you for working in this area, this is very welcome.

We do use UTF-8 and we did notice that regexp are not actually the fastest
around, albeit we did not (yet) run into the memory limit. Mostly, because
the regexp_match* stuff we use is only used in places where the
performance is not key and the input/output is small (albeit, now that I
mention it, the quadratic behaviour might explain a few slowdowns in other
cases I need to investigate).

Anyway, in a few places we have functions that use a lot (> a dozend)
regexps that are also moderate complex (e.g. span multiple lines). In
these cases the performance was not really up to par, so I experimented
and in the end rewrote the functions in plperl. Which fixed the
performance, so we no longer had this issue.

All the best,

Tels

Re: Undocumented(?) limits on regexp functions

From

Mark Dilger

Date:

14 August 2018, 22:54:40


> On Aug 14, 2018, at 10:01 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
> 
> Moin Andrew,
> 
> On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote:
>>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>> 
>>>> Should these limits:
>> 
>>>> a) be removed
>> 
>> Tom> Doubt it --- we could use the "huge" request variants, maybe, but
>> Tom> I wonder whether the engine could run fast enough that you'd want
>> Tom> to.
>> 
>> I do wonder (albeit without evidence) whether the quadratic slowdown
>> problem I posted a patch for earlier was ignored for so long because
>> people just went "meh, regexps are slow" rather than wondering why a
>> trivial splitting of a 40kbyte string was taking more than a second.
> 
> Pretty much this. :)
> 
> First of all, thank you for working in this area, this is very welcome.
> 
> We do use UTF-8 and we did notice that regexp are not actually the fastest
> around, albeit we did not (yet) run into the memory limit. Mostly, because
> the regexp_match* stuff we use is only used in places where the
> performance is not key and the input/output is small (albeit, now that I
> mention it, the quadratic behaviour might explain a few slowdowns in other
> cases I need to investigate).
> 
> Anyway, in a few places we have functions that use a lot (> a dozend)
> regexps that are also moderate complex (e.g. span multiple lines). In
> these cases the performance was not really up to par, so I experimented
> and in the end rewrote the functions in plperl. Which fixed the
> performance, so we no longer had this issue.

+1.  I have done something similar, though in C rather than plperl.

As for the length limit, I have only hit that in stress testing, not in
practice.

mark