Thread: REGEXP_REPLACE woes

REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 08:28:13

I want to transform the text '[p=1242|John Smith]' to
<a href="./family.php?person=1242">John Smith</a>, but what I get is:

pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]',
pgslekt(> E'[p=(\d+)|(.+?)]',
pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>');
                    regexp_replace
------------------------------------------------------
 [<a href="./family.php?person="></a>=1242|John Smith]
(1 row)

What am I doing wrong?

PostgreSQL 8.2.7 on i686-pc-linux-gnu BTW.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

Michael Fuhr

Date:

10 June 2008, 09:22:32

On Tue, Jun 10, 2008 at 01:28:06PM +0200, Leif B. Kristensen wrote:
> I want to transform the text '[p=1242|John Smith]' to
> <a href="./family.php?person=1242">John Smith</a>, but what I get is:
>
> pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]',
> pgslekt(> E'[p=(\d+)|(.+?)]',
> pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>');
>                     regexp_replace
> ------------------------------------------------------
>  [<a href="./family.php?person="></a>=1242|John Smith]
> (1 row)
>
> What am I doing wrong?

Parts of the regular expression need more escaping.  Try this:

select regexp_replace(
   '[p=1242|John Smith]',
  e'\\[p=(\\d+)\\|(.+?)\\]',
  e'<a href="./family.php?person=\\1">\\2</a>'
);

                  regexp_replace
---------------------------------------------------
 <a href="./family.php?person=1242">John Smith</a>

Caution: this method doesn't do HTML entity escaping so if your
input isn't trustworthy then you could end up with HTML that's
different from what you intended.

--
Michael Fuhr

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 09:26:22

On Tuesday 10. June 2008, Michael Fuhr wrote:
>Parts of the regular expression need more escaping.  Try this:
>
>select regexp_replace(
>   '[p=1242|John Smith]',
>  e'\\[p=(\\d+)\\|(.+?)\\]',
>  e'<a href="./family.php?person=\\1">\\2</a>'
>);
>
>                  regexp_replace
>---------------------------------------------------
> <a href="./family.php?person=1242">John Smith</a>

Thank you Michael, I figured it was something fishy with the escaping.
When I try your example, I get

pgslekt=> select regexp_replace(
pgslekt(>    '[p=1242|John Smith]',
pgslekt(>   e'\\[p=(\\d+)\\|(.+?)\\]',
pgslekt(>   e'<a href="./family.php?person=\\1">\\2</a>'
pgslekt(> );
ERROR:  syntax error at or near " "
LINE 2:    '[p=1242|John Smith]',

But with my own doctored code, it works just fine:

pgslekt=> select REGEXP_REPLACE(E'[p=1242|John Smith]',
E'\\[p=(\\d+)\\|(.+?)\\]',
E'<a href="./family.php?person=\\1">\\2</a>');
                  regexp_replace
---------------------------------------------------
 <a href="./family.php?person=1242">John Smith</a>
(1 row)

>Caution: this method doesn't do HTML entity escaping so if your
>input isn't trustworthy then you could end up with HTML that's
>different from what you intended.

The input is all my own from 127.0.0.1, so it's of course totally
trustworthy :-)
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 10:00:59

I put the code into a function, link_expand():

CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$
SELECT REGEXP_REPLACE($1,
             E'\\[p=(\\d+)\\|(.+?)\\]',
             E'<a href="./family.php?person=\\1">\\2</a>', 'g');
$$ LANGUAGE sql STABLE;

pgslekt=> select link_expand('[p=123|John Smith]');
                   link_expand
--------------------------------------------------
 <a href="./family.php?person=123">John Smith</a>
(1 row)

So far, so good. But look here:

pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
                              link_expand
-----------------------------------------------------------------------
 <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
(1 row)

Hey, I told it not to be greedy, didn't I?
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

Michael Fuhr

Date:

10 June 2008, 10:39:34

On Tue, Jun 10, 2008 at 02:25:44PM +0200, Leif B. Kristensen wrote:
> Thank you Michael, I figured it was something fishy with the escaping.
> When I try your example, I get
>
> pgslekt=> select regexp_replace(
> pgslekt(>    '[p=1242|John Smith]',
> pgslekt(>   e'\\[p=(\\d+)\\|(.+?)\\]',
> pgslekt(>   e'<a href="./family.php?person=\\1">\\2</a>'
> pgslekt(> );
> ERROR:  syntax error at or near " "
> LINE 2:    '[p=1242|John Smith]',

Something between my message and your shell appears to have converted
a few spaces to no-break spaces.  A hex dump of your query shows the
following:

00000000  73 65 6c 65 63 74 20 72  65 67 65 78 70 5f 72 65  |select regexp_re|
00000010  70 6c 61 63 65 28 0a c2  a0 20 c2 a0 27 5b 70 3d  |place(.   '[p=|

Notice the byte sequences "c2 a0", which is the UTF-8 encoding of
<U+00A0 NO-BREAK SPACE>.  Apparently psql doesn't like that.  I don't
see that sequence in my original message:

00000000  73 65 6c 65 63 74 20 72  65 67 65 78 70 5f 72 65  |select regexp_re|
00000010  70 6c 61 63 65 28 0a 20  20 20 27 5b 70 3d 31 32  |place(.   '[p=12|

--
Michael Fuhr

Re: REGEXP_REPLACE woes

From

Michael Fuhr

Date:

10 June 2008, 10:42:15

On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote:
> So far, so good. But look here:
>
> pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
>                               link_expand
> -----------------------------------------------------------------------
>  <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
> (1 row)
>
> Hey, I told it not to be greedy, didn't I?

Yes, but regexp_replace only replaces that part of the original
string that matches the regular expression -- the rest it leaves
alone.

--
Michael Fuhr

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 10:44:09

On Tuesday 10. June 2008, Leif B. Kristensen wrote:
>Hey, I told it not to be greedy, didn't I?

Found it. I must make *both* atoms non-greedy:

pgslekt=> CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS
$$
SELECT REGEXP_REPLACE($1,
            E'\\[p=(\\d+?)\\|(.+?)\\]',
            E'<a href="./family.php?person=\\1">\\2</a>', 'g');
$$ LANGUAGE sql STABLE;
CREATE FUNCTION
pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
                                             link_expand
-----------------------------------------------------------------------------------------------------
 <a href="./family.php?person=123">John Smith</a> and <a
href="./family.php?person=456">Jane Doe</a>
(1 row)

It's not totally intuitive, but at least now it works.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

Michael Fuhr

Date:

10 June 2008, 11:07:18

On Tue, Jun 10, 2008 at 07:41:53AM -0600, Michael Fuhr wrote:
> On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote:
> > So far, so good. But look here:
> >
> > pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
> >                               link_expand
> > -----------------------------------------------------------------------
> >  <a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
> > (1 row)
> >
> > Hey, I told it not to be greedy, didn't I?
>
> Yes, but regexp_replace only replaces that part of the original
> string that matches the regular expression -- the rest it leaves
> alone.

Sorry, this isn't quite right.  As you already discovered, the
pattern was being more greedy than you wanted.  That's one reason
why I often use an inverted class instead of assuming that a
non-greedy quantifier will grab only what I want.

select regexp_replace(
  '[p=123|John Smith] and [p=456|Jane Doe]',
  E'\\[p=(\\d+)\\|([^]]+)\\]',
  E'<a href="./family.php?person=\\1">\\2</a>',
  'g'
);

                                           regexp_replace
-----------------------------------------------------------------------------------------------------
 <a href="./family.php?person=123">John Smith</a> and <a href="./family.php?person=456">Jane Doe</a>

--
Michael Fuhr

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 11:11:11

On Tuesday 10. June 2008, CaT wrote:
>On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote:
>> On Tuesday 10. June 2008, Leif B. Kristensen wrote:
>> >Hey, I told it not to be greedy, didn't I?
>>
>> Found it. I must make *both* atoms non-greedy:
>
>That makes no sense. Take this bit of perl, which works as expected:
>
>$str = '[p=123|John Smith] and [p=456|Jane Doe]';
>
>print "before: '$str'\n";
>
>$str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #;
>
>print "after: '$str'\n";
>
>There's a bug in your version of pcre I think as postgres would have
>little to do with the regex itself (I be guessing).

I found this obscure reference in the Pg docs
<http://www.postgresql.org/docs/8.2/interactive/functions-matching.html>

"In short, when an RE contains both greedy and non-greedy
subexpressions, the total match length is either as long as possible or
as short as possible, according to the attribute assigned to the whole
RE. The attributes assigned to the subexpressions only affect how much
of that match they are allowed to "eat" relative to each other."

And it was what made me try the above approach. I agree that it doesn't
make much sense.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

CaT

Date:

10 June 2008, 11:20:21

On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote:
> On Tuesday 10. June 2008, Leif B. Kristensen wrote:
> >Hey, I told it not to be greedy, didn't I?
>
> Found it. I must make *both* atoms non-greedy:

That makes no sense. Take this bit of perl, which works as expected:

$str = '[p=123|John Smith] and [p=456|Jane Doe]';

print "before: '$str'\n";

$str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #;

print "after: '$str'\n";

There's a bug in your version of pcre I think as postgres would have
little to do with the regex itself (I be guessing).

--
  "Police noticed some rustling sounds from Linn's bottom area
  and on closer inspection a roll of cash was found protruding
  from Linn's anus, the full amount of cash taken in the robbery."
    - http://www.smh.com.au/news/world/robber-hides-loot-up-his-booty/2008/05/09/1210131248617.html

Re: REGEXP_REPLACE woes

From

Tom Lane

Date:

10 June 2008, 11:54:11

CaT <cat@zip.com.au> writes:
> There's a bug in your version of pcre I think as postgres would have
> little to do with the regex itself (I be guessing).

You be wrong ... PG uses Tcl's regex engine, not pcre, and this behavior
is as documented.  No, I don't know why Henry Spencer chose to do it
that way, but he's certainly forgotten more about REs than the rest of
us will ever know.

            regards, tom lane

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

10 June 2008, 14:09:47

On Tuesday 10. June 2008, Michael Fuhr wrote:

>Something between my message and your shell appears to have converted
>a few spaces to no-break spaces.  A hex dump of your query shows the
>following:
>
>00000000  73 65 6c 65 63 74 20 72  65 67 65 78 70 5f 72 65  |select
> regexp_re| 00000010  70 6c 61 63 65 28 0a c2  a0 20 c2 a0 27 5b 70 3d
>  |place(.   '[p=|
>
>Notice the byte sequences "c2 a0", which is the UTF-8 encoding of
><U+00A0 NO-BREAK SPACE>.  Apparently psql doesn't like that.  I don't
>see that sequence in my original message:

It's probably a KMail bug, or more likely a Qt or KDE library bug. IIRC
there was a similar bug in KNode some years ago when i tried to copy
and paste some Python code from a news discussion.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

11 June 2008, 06:19:22

For the record: I've got two different flavors of those "shortlinks".
The first one, [p=123|John Smith] is the one that I started this thread
with. The second one is just a person number like [p=123] and should be
expanded to a similar link, with the default person name (fetched by
get_person_name(123)) inserted.

Here's my full function that expands both kinds of shortlinks:

CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$
DECLARE
    str TEXT;
    tmp TEXT;
    name TEXT;
    p INTEGER;
BEGIN
    -- the easy part: replace [p=xxx|yyy] with full link
    str := REGEXP_REPLACE($1,
            E'\\[p=(\\d+?)\\|(.+?)\\]',
            E'<a href="./family.php?person=\\1">\\2</a>', 'g');
    -- the hard part: replace [p=xxx] with full link
    WHILE str SIMILAR TO E'%\\[p=\\d+\\]%' LOOP
        str := REGEXP_REPLACE(str,
            E'\\[p=(\\d+?)\\]',
            E'<a href="./family.php?person=\\1">#\\1#</a>');
        tmp := SUBSTRING(str, E'#\\d+?#');
        p := BTRIM(tmp, '#')::INTEGER;
        name := get_person_name(p);
        str := REPLACE(str, tmp, name);
    END LOOP;
    RETURN str;
END
$$ LANGUAGE plpgsql STABLE;

I still think that "the hard part" is a bit ugly, though.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: REGEXP_REPLACE woes

From

"Leif B. Kristensen"

Date:

11 June 2008, 07:24:22

On Wednesday 11. June 2008, Leif B. Kristensen wrote:

>        p := BTRIM(tmp, '#')::INTEGER;
>        name := get_person_name(p);
>        str := REPLACE(str, tmp, name);

I did some "folding" and replaced the above with

str := REPLACE(str, tmp, get_person_name(BTRIM(tmp, '#')::INTEGER));

and got rid of two variables.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/