Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context |
Date | |
Msg-id | 20180426191828.4wqpki2slu2ec6n7@alvherre.pgsql Whole thread Raw |
In response to | XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context (Markus Winand <markus.winand@winand.at>) |
Responses |
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
|
List | pgsql-hackers |
Hello Markus, Markus Winand wrote: > * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression. > > > Column_expressions that match TEXT or CDATA nodes must return > > the contents of the node itself, not the content of the > > non-existing childs (i.e. the empty string). > > The following query returns a wrong result in master: > > SELECT * > FROM (VALUES ('<xml>text</xml>'::xml) > , ('<xml><![CDATA[<cdata>]]></xml>'::xml) > ) d(x) > CROSS JOIN LATERAL > XMLTABLE('/xml' > PASSING x > COLUMNS "node()" TEXT PATH 'node()' > ) t > The correct result can be obtained with patch 0001 applied: > > x | node() > --------------------------------+--------- > <xml>text</xml> | text > <xml><![CDATA[<cdata>]]></xml> | <cdata> > > The patch adopts existing tests to guard against future regressions. I remembered while reading this that Craig Ringer had commented that XML text sections can have multiple children: just put XML comments amidst the text. To test this, I added a comment in one of the text values, in the regression database: update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml; With the modified data, the new query in the regression tests fails: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS id int PATH '@id', _id FOR ORDINALITY, country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, country_id text PATH 'COUNTRY_ID', region_id int PATH 'REGION_ID', size float PATH 'SIZE', unit text PATH 'SIZE/@unit', premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified'); ERROR: more than one value returned by column XPath expression This seems really undesirable, so I looked into getting it fixed. Turns out that this error is being thrown from inside the same function we're modifying, line 4559. I didn't have a terribly high opinion of that ereport() when working on this feature to begin with, so now that it's proven to provide a bad experience, let's try removing it. With that change (patch attached), the feature seems to work; I tried with this query, which seems to give the expected output (notice we have some columns of type xml, others of type text, with and without the text() function in the path): SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL, country_name_n text PATH 'COUNTRY_NAME' NOT NULL, country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL); country_name │ country_name_xml │ country_name_n │ country_name_xml_n ────────────────┼──────────────────┼────────────────┼─────────────────────────────────────────────────────── Australia │ Australia │ Australia │ <COUNTRY_NAME>Australia</COUNTRY_NAME> China │ China │ China │ <COUNTRY_NAME>China</COUNTRY_NAME> HongKong │ HongKong │ HongKong │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME> India │ India │ India │ <COUNTRY_NAME>India</COUNTRY_NAME> Japan │ Japan │ Japan │ <COUNTRY_NAME>Japan</COUNTRY_NAME> Singapore │ Singapore │ Singapore │ <COUNTRY_NAME>Singapore</COUNTRY_NAME> Czech Republic │ Czech Republic │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME> Germany │ Germany │ Germany │ <COUNTRY_NAME>Germany</COUNTRY_NAME> France │ France │ France │ <COUNTRY_NAME>France</COUNTRY_NAME> Egypt │ Egypt │ Egypt │ <COUNTRY_NAME>Egypt</COUNTRY_NAME> Sudan │ Sudan │ Sudan │ <COUNTRY_NAME>Sudan</COUNTRY_NAME> (11 filas) This means that the concatenation now works for all types, not just xml, so I can do this also: update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml; SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text float PATH 'SIZE/text()', "SIZE" float, size_xml xml PATH 'SIZE'); country_name │ size_text │ SIZE ────────────────┼───────────┼────── Australia │ │ China │ │ HongKong │ │ India │ │ Japan │ │ Singapore │ 791 │ 791 Czech Republic │ │ Germany │ │ France │ │ Egypt │ │ Sudan │ │ (11 filas) I'm not sure if this concatenation of things that are not text or XML is undesirable for whatever reason. Any clues? Also, array indexes behave funny. First let's add more XML comments inside that number, and query for the subscripts: update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml; SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text float PATH 'SIZE/text()', size_text_1 float PATH 'SIZE/text()[1]', size_text_2 float PATH 'SIZE/text()[2]', "SIZE" float, size_xml xml PATH 'SIZE') where size_text is not null; country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼─────────────────────────────────────────────────────── Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE> (1 fila) I don't know what to make of that. Seems pretty broken. After this, I looked for some examples of XPath syntax in the interwebs. I came across the | operator (which apparently is a "union" of some sort). Behold: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text text PATH 'SIZE/text() | SIZE/@unit') where size_text is not null ; country_name │ size_text ──────────────┼─────────── Singapore │ km791 (1 fila) The "units" property is ahead of the text(), which is pretty odd. But if I remove the text() call, it puts the units after the text: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text text PATH 'SIZE | SIZE/@unit') where size_text is not null ; country_name │ size_text ──────────────┼───────────────────────────────────────────────────────── Singapore │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km (1 fila) Again, I don't know what to make of this. Anyway, this seems firmly in the libxml side of things; the only conclusion I have is that nobody ever uses libxml terribly much for complex XPath processing. Basing another test on your original test case, look at the first row here. Is this result correct? SELECT * FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" TEXT PATH 'node()' ) t; x │ node() ───────────────────────────────────────────────────────────┼──────────────────────────────────── <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> Why was the comment contents expanded inside node()? Note what happens if I change the type from text to xml in that column: SELECT * FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" xml PATH 'node()' ) t; x │ node() ───────────────────────────────────────────────────────────┼──────────────────────────────────────────────── <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> (2 filas) Further note that, per the standard, you can omit the PATH clause in which case the column name is used as the path, which seems to work correctly. Phew! > * Patch 0002: Set correct context for XPath evaluation. > > > According to the ISO standard, the context of XMLTABLE's XPath > > row_expression is the document node of the XML[1] - not the root node. > > > > The respective code is shared with non-ISO functions such as xpath > > and xpath_exists so that the change also affects these functions. > > It’s an incompatible change - even the regression tests need to be adjusted because they > test for the “wrong” behaviour. I'm really afraid to change the non-ISO functions in PG10, since it's already released for quite some time with this long-standing behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10 and leave the non-iso functions unpatched. I think the easiest route is to patch all functions only for PostgreSQL 11. Maybe if XMLTABLE is considered unacceptably broken in PostgreSQL 10 we can patch only that one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: