Re: XML with invalid chars - Mailing list pgsql-hackers
From | Andrew Dunstan |
---|---|
Subject | Re: XML with invalid chars |
Date | |
Msg-id | 4DCB0AE3.5020503@dunslane.net Whole thread Raw |
In response to | Re: XML with invalid chars (Noah Misch <noah@leadboat.com>) |
Responses |
Re: XML with invalid chars
|
List | pgsql-hackers |
On 05/09/2011 11:25 PM, Noah Misch wrote: > > I see you've gone with doing it unconditionally. I'd lean toward testing the > library in pg_xml_init and setting a flag indicating whether we need the extra > pass. However, a later patch can always optimize that. > I wasn't terribly keen on the idea, but we can look at it again later. >> >> Please review and try to break. > Here are the test cases I tried: > > -- caught successfully > SELECT E'\x01'::xml; > SELECT xmlcomment(E'\x01'); > SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), ''); > SELECT xmlelement(name foo, NULL, E'\x01'); > SELECT xmlforest(E'\x01' AS foo); > SELECT xmlpi(name foo, E'\x01'); > SELECT query_to_xml($$SELECT E'\x01'$$, true, false, ''); > > -- not caught > SELECT xmlroot('<root/>', version E'\x01'); That's an easy fix. > SELECT xmlcomment(E'\ufffe'); That's a bit harder. Do we want to extend these checks to cover surrogates and end of plane characters, which are the remaining forbidden chars? It certainly seems likely to be a somewhat slower test since I think we'd need to process the input strings a Unicode char at a time, but we do that in other places and it seems acceptable. What do people think? > -- not directly related, but also wrongly accepted > SELECT xmlroot('<root/>', version ' '); > SELECT xmlroot('<root/>', version 'foo'); > > Offhand, I don't find libxml2's handling of XML declarations particularly > consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption = > document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'. > Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content) > accepts anything, even control characters. The XML 1.0 standard is stricter: > the version must match ^1\.[0-9]+$. We might want to tighten this at the same > time. We can add some stuff to check the version strings. Doesn't seem terribly difficult. > libxml2's error message for this case is "PCDATA invalid Char value 1" > (assuming \x01). Mentioning PCDATA seems redundant, since no other context > offers greater freedom. How about: > > ereport(ERROR, > (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > errmsg("invalid XML 1.0 Char \\U%08x", char_val))); > > That would also mean processing the string a unicode char at a time. So maybe that's what we need to do. Thanks for the input. cheers andrew
pgsql-hackers by date: