Thread: BUG #17761: Questionable regular expression behavior
The following bug has been logged on the website: Bug reference: 17761 Logged by: Konstantin Geordzhev Email address: kosiodg@yahoo.com PostgreSQL version: 11.10 Operating system: tested online Description: Executing: select regexp_matches('a 1x1250x2500', '(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); returns: {a,1,1,NULL} while executing: select regexp_matches('a 1x1250x2500', '(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); returns: {a,1,1250,2500} Shouldn't both results be equal? Tested online at: https://extendsclass.com/postgresql-online.html and on ubuntu version 9.5
On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 17761 > Logged by: Konstantin Geordzhev > Email address: kosiodg@yahoo.com > PostgreSQL version: 11.10 > Operating system: tested online > Description: > > Executing: > select regexp_matches('a 1x1250x2500', > '(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); > returns: {a,1,1,NULL} > while executing: > select regexp_matches('a 1x1250x2500', > '(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); > returns: {a,1,1250,2500} > > Shouldn't both results be equal? The problem is, afair, that there is some state in pg's regexp engine that makes greedy/ungreedy decision once per regexp. I don't recall details, but my take from back when I learned about it (years ago) is to try to avoid things like .*? Instead you can: #v+ $ select regexp_matches('a 1x1250x2500', '(a)\D*([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); regexp_matches ───────────────── {a,1,1250,2500} (1 row) #v- depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote: >> Executing: >> select regexp_matches('a 1x1250x2500', >> '(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); >> returns: {a,1,1,NULL} >> while executing: >> select regexp_matches('a 1x1250x2500', >> '(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?'); >> returns: {a,1,1250,2500} >> >> Shouldn't both results be equal? > The problem is, afair, that there is some state in pg's regexp engine > that makes greedy/ungreedy decision once per regexp. Yeah. Without having traced through it, I'm fairly sure that in the first case, we have "(a)" which has no greediness, then ".*?" which is non-greedy, and then that determines the overall greediness as non-greedy, so it goes for the shortest overall match not the longest. In the second case, "(a|b)" is greedy because anything involving "|" is greedy, so we immediately decide we'll be greedy overall. The fine manual explains how you can force greediness or non-greediness when the engine's default rules for that don't do what you want. regards, tom lane
Yes, greediness seems to be the case,
One other solution I found to make it greedy is to add '.*$' at the end:
'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?.*$'
В петък, 27 януари 2023 г., 18:05:01 ч. Гринуич+2, Tom Lane <tgl@sss.pgh.pa.us> написа:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote:
>> Executing:
>> select regexp_matches('a 1x1250x2500',
>> '(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
>> returns: {a,1,1,NULL}
>> while executing:
>> select regexp_matches('a 1x1250x2500',
>> '(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
>> returns: {a,1,1250,2500}
>>
>> Shouldn't both results be equal?
> The problem is, afair, that there is some state in pg's regexp engine
> that makes greedy/ungreedy decision once per regexp.
Yeah. Without having traced through it, I'm fairly sure that in the
first case, we have "(a)" which has no greediness, then ".*?" which
is non-greedy, and then that determines the overall greediness as
non-greedy, so it goes for the shortest overall match not the longest.
In the second case, "(a|b)" is greedy because anything involving "|"
is greedy, so we immediately decide we'll be greedy overall.
The fine manual explains how you can force greediness or non-greediness
when the engine's default rules for that don't do what you want.
regards, tom lane
> On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote:
>> Executing:
>> select regexp_matches('a 1x1250x2500',
>> '(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
>> returns: {a,1,1,NULL}
>> while executing:
>> select regexp_matches('a 1x1250x2500',
>> '(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
>> returns: {a,1,1250,2500}
>>
>> Shouldn't both results be equal?
> The problem is, afair, that there is some state in pg's regexp engine
> that makes greedy/ungreedy decision once per regexp.
Yeah. Without having traced through it, I'm fairly sure that in the
first case, we have "(a)" which has no greediness, then ".*?" which
is non-greedy, and then that determines the overall greediness as
non-greedy, so it goes for the shortest overall match not the longest.
In the second case, "(a|b)" is greedy because anything involving "|"
is greedy, so we immediately decide we'll be greedy overall.
The fine manual explains how you can force greediness or non-greediness
when the engine's default rules for that don't do what you want.
regards, tom lane