Thread: REGEXP_REPLACE woes
I want to transform the text '[p=1242|John Smith]' to <a href="./family.php?person=1242">John Smith</a>, but what I get is: pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]', pgslekt(> E'[p=(\d+)|(.+?)]', pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>'); regexp_replace ------------------------------------------------------ [<a href="./family.php?person="></a>=1242|John Smith] (1 row) What am I doing wrong? PostgreSQL 8.2.7 on i686-pc-linux-gnu BTW. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
On Tue, Jun 10, 2008 at 01:28:06PM +0200, Leif B. Kristensen wrote: > I want to transform the text '[p=1242|John Smith]' to > <a href="./family.php?person=1242">John Smith</a>, but what I get is: > > pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]', > pgslekt(> E'[p=(\d+)|(.+?)]', > pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>'); > regexp_replace > ------------------------------------------------------ > [<a href="./family.php?person="></a>=1242|John Smith] > (1 row) > > What am I doing wrong? Parts of the regular expression need more escaping. Try this: select regexp_replace( '[p=1242|John Smith]', e'\\[p=(\\d+)\\|(.+?)\\]', e'<a href="./family.php?person=\\1">\\2</a>' ); regexp_replace --------------------------------------------------- <a href="./family.php?person=1242">John Smith</a> Caution: this method doesn't do HTML entity escaping so if your input isn't trustworthy then you could end up with HTML that's different from what you intended. -- Michael Fuhr
On Tuesday 10. June 2008, Michael Fuhr wrote: >Parts of the regular expression need more escaping. Try this: > >select regexp_replace( > '[p=1242|John Smith]', > e'\\[p=(\\d+)\\|(.+?)\\]', > e'<a href="./family.php?person=\\1">\\2</a>' >); > > regexp_replace >--------------------------------------------------- > <a href="./family.php?person=1242">John Smith</a> Thank you Michael, I figured it was something fishy with the escaping. When I try your example, I get pgslekt=> select regexp_replace( pgslekt(> '[p=1242|John Smith]', pgslekt(> e'\\[p=(\\d+)\\|(.+?)\\]', pgslekt(> e'<a href="./family.php?person=\\1">\\2</a>' pgslekt(> ); ERROR: syntax error at or near " " LINE 2: '[p=1242|John Smith]', But with my own doctored code, it works just fine: pgslekt=> select REGEXP_REPLACE(E'[p=1242|John Smith]', E'\\[p=(\\d+)\\|(.+?)\\]', E'<a href="./family.php?person=\\1">\\2</a>'); regexp_replace --------------------------------------------------- <a href="./family.php?person=1242">John Smith</a> (1 row) >Caution: this method doesn't do HTML entity escaping so if your >input isn't trustworthy then you could end up with HTML that's >different from what you intended. The input is all my own from 127.0.0.1, so it's of course totally trustworthy :-) -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
I put the code into a function, link_expand(): CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$ SELECT REGEXP_REPLACE($1, E'\\[p=(\\d+)\\|(.+?)\\]', E'<a href="./family.php?person=\\1">\\2</a>', 'g'); $$ LANGUAGE sql STABLE; pgslekt=> select link_expand('[p=123|John Smith]'); link_expand -------------------------------------------------- <a href="./family.php?person=123">John Smith</a> (1 row) So far, so good. But look here: pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]'); link_expand ----------------------------------------------------------------------- <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a> (1 row) Hey, I told it not to be greedy, didn't I? -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
On Tue, Jun 10, 2008 at 02:25:44PM +0200, Leif B. Kristensen wrote: > Thank you Michael, I figured it was something fishy with the escaping. > When I try your example, I get > > pgslekt=> select regexp_replace( > pgslekt(> '[p=1242|John Smith]', > pgslekt(> e'\\[p=(\\d+)\\|(.+?)\\]', > pgslekt(> e'<a href="./family.php?person=\\1">\\2</a>' > pgslekt(> ); > ERROR: syntax error at or near " " > LINE 2: '[p=1242|John Smith]', Something between my message and your shell appears to have converted a few spaces to no-break spaces. A hex dump of your query shows the following: 00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select regexp_re| 00000010 70 6c 61 63 65 28 0a c2 a0 20 c2 a0 27 5b 70 3d |place(. '[p=| Notice the byte sequences "c2 a0", which is the UTF-8 encoding of <U+00A0 NO-BREAK SPACE>. Apparently psql doesn't like that. I don't see that sequence in my original message: 00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select regexp_re| 00000010 70 6c 61 63 65 28 0a 20 20 20 27 5b 70 3d 31 32 |place(. '[p=12| -- Michael Fuhr
On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote: > So far, so good. But look here: > > pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]'); > link_expand > ----------------------------------------------------------------------- > <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a> > (1 row) > > Hey, I told it not to be greedy, didn't I? Yes, but regexp_replace only replaces that part of the original string that matches the regular expression -- the rest it leaves alone. -- Michael Fuhr
On Tuesday 10. June 2008, Leif B. Kristensen wrote: >Hey, I told it not to be greedy, didn't I? Found it. I must make *both* atoms non-greedy: pgslekt=> CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$ SELECT REGEXP_REPLACE($1, E'\\[p=(\\d+?)\\|(.+?)\\]', E'<a href="./family.php?person=\\1">\\2</a>', 'g'); $$ LANGUAGE sql STABLE; CREATE FUNCTION pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]'); link_expand ----------------------------------------------------------------------------------------------------- <a href="./family.php?person=123">John Smith</a> and <a href="./family.php?person=456">Jane Doe</a> (1 row) It's not totally intuitive, but at least now it works. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
On Tue, Jun 10, 2008 at 07:41:53AM -0600, Michael Fuhr wrote: > On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote: > > So far, so good. But look here: > > > > pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]'); > > link_expand > > ----------------------------------------------------------------------- > > <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a> > > (1 row) > > > > Hey, I told it not to be greedy, didn't I? > > Yes, but regexp_replace only replaces that part of the original > string that matches the regular expression -- the rest it leaves > alone. Sorry, this isn't quite right. As you already discovered, the pattern was being more greedy than you wanted. That's one reason why I often use an inverted class instead of assuming that a non-greedy quantifier will grab only what I want. select regexp_replace( '[p=123|John Smith] and [p=456|Jane Doe]', E'\\[p=(\\d+)\\|([^]]+)\\]', E'<a href="./family.php?person=\\1">\\2</a>', 'g' ); regexp_replace ----------------------------------------------------------------------------------------------------- <a href="./family.php?person=123">John Smith</a> and <a href="./family.php?person=456">Jane Doe</a> -- Michael Fuhr
On Tuesday 10. June 2008, CaT wrote: >On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote: >> On Tuesday 10. June 2008, Leif B. Kristensen wrote: >> >Hey, I told it not to be greedy, didn't I? >> >> Found it. I must make *both* atoms non-greedy: > >That makes no sense. Take this bit of perl, which works as expected: > >$str = '[p=123|John Smith] and [p=456|Jane Doe]'; > >print "before: '$str'\n"; > >$str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #; > >print "after: '$str'\n"; > >There's a bug in your version of pcre I think as postgres would have >little to do with the regex itself (I be guessing). I found this obscure reference in the Pg docs <http://www.postgresql.org/docs/8.2/interactive/functions-matching.html> "In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to "eat" relative to each other." And it was what made me try the above approach. I agree that it doesn't make much sense. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote: > On Tuesday 10. June 2008, Leif B. Kristensen wrote: > >Hey, I told it not to be greedy, didn't I? > > Found it. I must make *both* atoms non-greedy: That makes no sense. Take this bit of perl, which works as expected: $str = '[p=123|John Smith] and [p=456|Jane Doe]'; print "before: '$str'\n"; $str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #; print "after: '$str'\n"; There's a bug in your version of pcre I think as postgres would have little to do with the regex itself (I be guessing). -- "Police noticed some rustling sounds from Linn's bottom area and on closer inspection a roll of cash was found protruding from Linn's anus, the full amount of cash taken in the robbery." - http://www.smh.com.au/news/world/robber-hides-loot-up-his-booty/2008/05/09/1210131248617.html
CaT <cat@zip.com.au> writes: > There's a bug in your version of pcre I think as postgres would have > little to do with the regex itself (I be guessing). You be wrong ... PG uses Tcl's regex engine, not pcre, and this behavior is as documented. No, I don't know why Henry Spencer chose to do it that way, but he's certainly forgotten more about REs than the rest of us will ever know. regards, tom lane
On Tuesday 10. June 2008, Michael Fuhr wrote: >Something between my message and your shell appears to have converted >a few spaces to no-break spaces. A hex dump of your query shows the >following: > >00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select > regexp_re| 00000010 70 6c 61 63 65 28 0a c2 a0 20 c2 a0 27 5b 70 3d > |place(. '[p=| > >Notice the byte sequences "c2 a0", which is the UTF-8 encoding of ><U+00A0 NO-BREAK SPACE>. Apparently psql doesn't like that. I don't >see that sequence in my original message: It's probably a KMail bug, or more likely a Qt or KDE library bug. IIRC there was a similar bug in KNode some years ago when i tried to copy and paste some Python code from a news discussion. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
For the record: I've got two different flavors of those "shortlinks". The first one, [p=123|John Smith] is the one that I started this thread with. The second one is just a person number like [p=123] and should be expanded to a similar link, with the default person name (fetched by get_person_name(123)) inserted. Here's my full function that expands both kinds of shortlinks: CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$ DECLARE str TEXT; tmp TEXT; name TEXT; p INTEGER; BEGIN -- the easy part: replace [p=xxx|yyy] with full link str := REGEXP_REPLACE($1, E'\\[p=(\\d+?)\\|(.+?)\\]', E'<a href="./family.php?person=\\1">\\2</a>', 'g'); -- the hard part: replace [p=xxx] with full link WHILE str SIMILAR TO E'%\\[p=\\d+\\]%' LOOP str := REGEXP_REPLACE(str, E'\\[p=(\\d+?)\\]', E'<a href="./family.php?person=\\1">#\\1#</a>'); tmp := SUBSTRING(str, E'#\\d+?#'); p := BTRIM(tmp, '#')::INTEGER; name := get_person_name(p); str := REPLACE(str, tmp, name); END LOOP; RETURN str; END $$ LANGUAGE plpgsql STABLE; I still think that "the hard part" is a bit ugly, though. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
On Wednesday 11. June 2008, Leif B. Kristensen wrote: > p := BTRIM(tmp, '#')::INTEGER; > name := get_person_name(p); > str := REPLACE(str, tmp, name); I did some "folding" and replaced the above with str := REPLACE(str, tmp, get_person_name(BTRIM(tmp, '#')::INTEGER)); and got rid of two variables. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/