Thread: 9.5.3: substring: regex greedy operator not picking up chars as expected
9.5.3: substring: regex greedy operator not picking up chars as expected
From
"Foster, Russell"
Date:
Hello, For the following query: select substring('>772' from '.*?[0-9]+') I would expect the output to be '>772', but it is '>7'. You can also see t= he expected result on https://regex101.com/, although I am aware not all re= gex processors work the same. The following queries: select substring('>772' from '^.*?[0-9]+$') and: select substring('>772' from '[0-9]+') both return '>772', which is expected. Could the less greedy operator on t= he left (.*?) be affecting the more greedy right one (+)? Thanks, Russell Foster
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected
From
"David G. Johnston"
Date:
=E2=80=8BWorking as documented.=E2=80=8B https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MA= TCHING-RULES Specifically, this implementation considers greediness at a level higher than just the atom/expression - and in a mixed "branch" if there is a non-greedy quantifier in a branch the entire branch is non-greedy and can in many situations cause greedy atoms to behave non-greedily. In might help to consider that there aren't really any explicit "greedy" operators like other engines have (i.e., ??, ?, ?+) but rather non-greedy (lazy) and default. The default inherits the non-greedy trait from its parent if applicable otherwise is behaves greedily. On Mon, Aug 15, 2016 at 7:53 AM, Foster, Russell <Russell.Foster@crl.com> wrote: > Hello, > > > > For the following query: > > > > select substring('>772' from '.*?[0-9]+') > =E2=80=8BThe pattern itself is non-greedy=E2=80=8B due to their only being = a single branch and it having a non-greedy quantifier within it. .*? matches ">" and [0-9]+ only needs a single character to generate a non-greedy match conforming match > > I would expect the output to be =E2=80=98>772=E2=80=99, but it is =E2=80= =98>7=E2=80=99. You can also see > the expected result on https://regex101.com/, although I am aware not all > regex processors work the same. > > > > The following queries: > > > > select substring('>772' from '^.*?[0-9]+$') > =E2=80=8BThis is treated exactly the same as the above but because of the ^= $ the shortest possible output string is the entire string=E2=80=8B > > and: > > > > select substring('>772' from '[0-9]+') > > > > both return =E2=80=98>772=E2=80=99, which is expected. Could the less gr= eedy operator on > the left (.*?) be affecting the more greedy right one (+)? > > > Typo here? I'm not fluent with substring(regex). Anyway, the entire RE (single branch) is now greedy so the greedy [0-9]+ atom matches as many numbers as possible. David J.
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected
From
"Foster, Russell"
Date:
SGkgRGF2aWQsDQoNCk11c3QgaGF2ZSBtaXNzZWQgdGhhdCBpbiB0aGUgbWFudWFsLCBidXQgbWFr ZXMgc2Vuc2Ugbm93LiAgU29tZXdoYXQgc3RyYW5nZSBiZWhhdmlvciB0aGF0IGEgbm9uLWdyZWVk eSBxdWFudGlmaWVyIGJhc2ljYWxseSBydWlucyB0aGUgcmVzdCBvZiB0aGUgZXhwcmVzc2lvbiBm b3IgdGhlIGdyZWVkeSBvbmVzLCBidXQgYXQgbGVhc3QgaXTigJlzIHdvcmtpbmcgYXMgZGVzaWdu ZWQuICBUaGFua3MhDQoNClJ1c3NlbGwNCg0KRnJvbTogRGF2aWQgRy4gSm9obnN0b24gW21haWx0 bzpkYXZpZC5nLmpvaG5zdG9uQGdtYWlsLmNvbV0NClNlbnQ6IDE1IEF1Z3VzdCAyMDE2IDg6NDUg QU0NClRvOiBGb3N0ZXIsIFJ1c3NlbGwgPFJ1c3NlbGwuRm9zdGVyQGNybC5jb20+DQpDYzogcGdz cWwtYnVnc0Bwb3N0Z3Jlc3FsLm9yZw0KU3ViamVjdDogUmU6IFtCVUdTXSA5LjUuMzogc3Vic3Ry aW5nOiByZWdleCBncmVlZHkgb3BlcmF0b3Igbm90IHBpY2tpbmcgdXAgY2hhcnMgYXMgZXhwZWN0 ZWQNCg0K4oCLV29ya2luZyBhcyBkb2N1bWVudGVkLuKAiw0KDQpodHRwczovL3d3dy5wb3N0Z3Jl c3FsLm9yZy9kb2NzLzkuNS9zdGF0aWMvZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwjUE9TSVgtTUFU Q0hJTkctUlVMRVM8aHR0cHM6Ly9uYTAxLnNhZmVsaW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29t Lz91cmw9aHR0cHMlM2ElMmYlMmZ3d3cucG9zdGdyZXNxbC5vcmclMmZkb2NzJTJmOS41JTJmc3Rh dGljJTJmZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwlMjNQT1NJWC1NQVRDSElORy1SVUxFUyZkYXRh PTAxJTdjMDElN2NSdXNzZWxsLkZvc3RlciU0MGNybC5jb20lN2NjMWU3MTM1OWVhOGM0YWEwYTQw MDA4ZDNjNTA5ZjdjZiU3YzM3NGY4OTMwZTE1MDQwMzFiYjM1NDgzMjE1ZmU1OTAwJTdjMCZzZGF0 YT1uNEZtV1ppMCUyZiUyYmRnWjVLclkzQmZrMU8wbnBiVkdLJTJiUkNIV25OTU1tWFZvJTNkPg0K DQpTcGVjaWZpY2FsbHksIHRoaXMgaW1wbGVtZW50YXRpb24gY29uc2lkZXJzIGdyZWVkaW5lc3Mg YXQgYSBsZXZlbCBoaWdoZXIgdGhhbiBqdXN0IHRoZSBhdG9tL2V4cHJlc3Npb24gLSBhbmQgaW4g YSBtaXhlZCAiYnJhbmNoIiBpZiB0aGVyZSBpcyBhIG5vbi1ncmVlZHkgcXVhbnRpZmllciBpbiBh IGJyYW5jaCB0aGUgZW50aXJlIGJyYW5jaCBpcyBub24tZ3JlZWR5IGFuZCBjYW4gaW4gbWFueSBz aXR1YXRpb25zIGNhdXNlIGdyZWVkeSBhdG9tcyB0byBiZWhhdmUgbm9uLWdyZWVkaWx5Lg0KDQpJ biBtaWdodCBoZWxwIHRvIGNvbnNpZGVyIHRoYXQgdGhlcmUgYXJlbid0IHJlYWxseSBhbnkgZXhw bGljaXQgImdyZWVkeSIgb3BlcmF0b3JzIGxpa2Ugb3RoZXIgZW5naW5lcyBoYXZlIChpLmUuLCA/ PywgPywgPyspIGJ1dCByYXRoZXIgbm9uLWdyZWVkeSAobGF6eSkgYW5kIGRlZmF1bHQuICBUaGUg ZGVmYXVsdCBpbmhlcml0cyB0aGUgbm9uLWdyZWVkeSB0cmFpdCBmcm9tIGl0cyBwYXJlbnQgaWYg YXBwbGljYWJsZSBvdGhlcndpc2UgaXMgYmVoYXZlcyBncmVlZGlseS4NCg0KT24gTW9uLCBBdWcg MTUsIDIwMTYgYXQgNzo1MyBBTSwgRm9zdGVyLCBSdXNzZWxsIDxSdXNzZWxsLkZvc3RlckBjcmwu Y29tPG1haWx0bzpSdXNzZWxsLkZvc3RlckBjcmwuY29tPj4gd3JvdGU6DQpIZWxsbywNCg0KRm9y IHRoZSBmb2xsb3dpbmcgcXVlcnk6DQoNCnNlbGVjdCBzdWJzdHJpbmcoJz43NzInIGZyb20gJy4q P1swLTldKycpDQoNCuKAi1RoZSBwYXR0ZXJuIGl0c2VsZiBpcyBub24tZ3JlZWR54oCLIGR1ZSB0 byB0aGVpciBvbmx5IGJlaW5nIGEgc2luZ2xlIGJyYW5jaCBhbmQgaXQgaGF2aW5nIGEgbm9uLWdy ZWVkeSBxdWFudGlmaWVyIHdpdGhpbiBpdC4NCg0KLio/IG1hdGNoZXMgIj4iIGFuZCBbMC05XSsg b25seSBuZWVkcyBhIHNpbmdsZSBjaGFyYWN0ZXIgdG8gZ2VuZXJhdGUgYSBub24tZ3JlZWR5IG1h dGNoIGNvbmZvcm1pbmcgbWF0Y2gNCg0KDQpJIHdvdWxkIGV4cGVjdCB0aGUgb3V0cHV0IHRvIGJl IOKAmD43NzLigJksIGJ1dCBpdCBpcyDigJg+N+KAmS4gIFlvdSBjYW4gYWxzbyBzZWUgdGhlIGV4 cGVjdGVkIHJlc3VsdCBvbiBodHRwczovL3JlZ2V4MTAxLmNvbS88aHR0cHM6Ly9uYTAxLnNhZmVs aW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29tLz91cmw9aHR0cHMlM2ElMmYlMmZyZWdleDEwMS5j b20lMmYmZGF0YT0wMSU3YzAxJTdjUnVzc2VsbC5Gb3N0ZXIlNDBjcmwuY29tJTdjYzFlNzEzNTll YThjNGFhMGE0MDAwOGQzYzUwOWY3Y2YlN2MzNzRmODkzMGUxNTA0MDMxYmIzNTQ4MzIxNWZlNTkw MCU3YzAmc2RhdGE9eWU1NVRkUHhHT0I2TlVvRG44NWwlMmZFZzhvOU1nWVBrYk92JTJiZzRtR2FY dzQlM2Q+LCBhbHRob3VnaCBJIGFtIGF3YXJlIG5vdCBhbGwgcmVnZXggcHJvY2Vzc29ycyB3b3Jr IHRoZSBzYW1lLg0KDQpUaGUgZm9sbG93aW5nIHF1ZXJpZXM6DQoNCnNlbGVjdCBzdWJzdHJpbmco Jz43NzInIGZyb20gJ14uKj9bMC05XSskJykNCg0K4oCLVGhpcyBpcyB0cmVhdGVkIGV4YWN0bHkg dGhlIHNhbWUgYXMgdGhlIGFib3ZlIGJ1dCBiZWNhdXNlIG9mIHRoZSBeJCB0aGUgc2hvcnRlc3Qg cG9zc2libGUgb3V0cHV0IHN0cmluZyBpcyB0aGUgZW50aXJlIHN0cmluZ+KAiw0KDQoNCmFuZDoN Cg0Kc2VsZWN0IHN1YnN0cmluZygnPjc3MicgZnJvbSAnWzAtOV0rJykNCg0KYm90aCByZXR1cm4g 4oCYPjc3MuKAmSwgd2hpY2ggaXMgZXhwZWN0ZWQuICBDb3VsZCB0aGUgbGVzcyBncmVlZHkgb3Bl cmF0b3Igb24gdGhlIGxlZnQgKC4qPykgYmUgYWZmZWN0aW5nIHRoZSBtb3JlIGdyZWVkeSByaWdo dCBvbmUgKCspPw0KDQoNClR5cG8gaGVyZT8gSSdtIG5vdCBmbHVlbnQgd2l0aCBzdWJzdHJpbmco cmVnZXgpLg0KDQpBbnl3YXksIHRoZSBlbnRpcmUgUkUgKHNpbmdsZSBicmFuY2gpIGlzIG5vdyBn cmVlZHkgc28gdGhlIGdyZWVkeSBbMC05XSsgYXRvbSBtYXRjaGVzIGFzIG1hbnkgbnVtYmVycyBh cyBwb3NzaWJsZS4NCg0KRGF2aWQgSi4NCg0K
"Foster, Russell" <Russell.Foster@crl.com> writes: > For the following query: > select substring('>772' from '.*?[0-9]+') > I would expect the output to be '>772', but it is '>7'. As David pointed out, that's what you get because the RE as a whole is considered to be non-greedy, ie you get the shortest overall match. However, you can adjust that by decorating the RE: # select substring('>772' from '(.*?[0-9]+){1,1}'); substring ----------- >772 (1 row) Now it's longest-overall, but the .*? part is still shortest-match, so it doesn't consume any digits. However, I suspect that still is not quite what you want, because it consumes too much in cases like: # select substring('>772foo444' from '(.*?[0-9]+){1,1}'); substring ------------ >772foo444 (1 row) There's probably really no way out of that except to be less lazy about writing the pattern: # select substring('>772foo444' from '([^0-9]*?[0-9]+){1,1}'); substring ----------- >772 (1 row) and in that formulation, of course, greediness doesn't really matter because there is only one way to match. # select substring('>772foo444' from '[^0-9]*[0-9]+'); substring ----------- >772 (1 row) See https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES regards, tom lane