Thread: encode, lower and 0x8a
Perhaps my understanding of the 'encode' function is incorrect, but I was under the impression that I could do something like: SELECT lower(encode(bytes, 'escape')) FROM mytable; as it sounded like (from the manual) that 'encode' would return valid ASCII, with all the non-ascii bytes hex escaped. When I have the byte 0x8a, however, I get the error: ERROR: invalid byte sequence for encoding "UTF8": 0x8a I have the sneaking suspicion that I am missing something, so please correct me if I am wrong. If I am wrong, is there a better way to lowercase all the ascii characters in a bytea string? Here is a simple way to recreate this: CREATE TABLE mytable (bytes BYTEA); INSERT INTO mytable VALUES (E'212'); SELECT lower(encode(bytes, 'escape')) FROM mytable; Thanks -Mike
On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote: > Perhaps my understanding of the 'encode' function is incorrect, but I > was under the impression that I could do something like: > > SELECT lower(encode(bytes, 'escape')) FROM mytable; > > as it sounded like (from the manual) that 'encode' would return valid > ASCII, with all the non-ascii bytes hex escaped. The documentation for encode() does give that impression: "Encode binary string to ASCII-only representation. Supported types are: base64, hex, escape." However, the source code for esc_encode() in src/backend/utils/adt/encode.c says and does otherwise: * Only two characters are escaped: * \0 (null) and \\ (backslash) > When I have the byte 0x8a, however, I get the error: > > ERROR: invalid byte sequence for encoding "UTF8": 0x8a Since encode() returns text and doesn't escape non-ASCII characters, all of the original binary data will be treated as though it's text in the database's encoding. If the data contains byte sequences that aren't valid in that encoding then you get the above error. > I have the sneaking suspicion that I am missing something, so please > correct me if I am wrong. If I am wrong, is there a better way to > lowercase all the ascii characters in a bytea string? What are you trying to do? What is the binary data and why are you treating it (or part of it) as though it's text? Do you want the end result to be text with escape sequences or do you want to convert it back to bytea? Something like this might work: SELECT lower(textin(byteaout(bytes))) FROM mytable; To turn the result back into bytea: SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable; -- Michael Fuhr
On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote: > On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote: > > Perhaps my understanding of the 'encode' function is incorrect, but I > > was under the impression that I could do something like: > > > > SELECT lower(encode(bytes, 'escape')) FROM mytable; > > > > as it sounded like (from the manual) that 'encode' would return valid > > ASCII, with all the non-ascii bytes hex escaped. ...snip... > > I have the sneaking suspicion that I am missing something, so please > > correct me if I am wrong. If I am wrong, is there a better way to > > lowercase all the ascii characters in a bytea string? > > What are you trying to do? What is the binary data and why are you > treating it (or part of it) as though it's text? Do you want the > end result to be text with escape sequences or do you want to convert > it back to bytea? The data are binary strings culled from network traffic. Most of the "binary strings" are just ascii strings, however not all of them are, and some of the ascii strings have binary characters embedded in them. In this case, I was displaying the string to a user via a web browser, in which case I wanted the output to be escaped. The reason for the 'lower()' was to enable the case-insensitive sort, so actually the offending line of the program would look something more like: SELECT encode(bytes, 'escape') FROM mytables ORDER BY lower(encode(bytes, 'escape')) > Something like this might work: > > SELECT lower(textin(byteaout(bytes))) FROM mytable; > > To turn the result back into bytea: > > SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable; That seems to work correctly, however I missed the functions textin' and 'byteaout' in the docs ... are they documented somewhere? Thanks, -Mike
On Mon, Jan 29, 2007 at 12:52:33PM -0500, Michael Artz wrote: > On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote: > >SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable; > > That seems to work correctly, however I missed the functions textin' > and 'byteaout' in the docs ... are they documented somewhere? Those are the types' input/output functions. They're not specifically documented but "User-Defined Types" in the "Extending SQL" chapter discusses the concept; see also CREATE TYPE. http://www.postgresql.org/docs/8.2/interactive/xtypes.html http://www.postgresql.org/docs/8.2/interactive/sql-createtype.html Using the input/output functions is a bit hackish since, as you discovered, their exact behavior isn't documented. -- Michael Fuhr
I have updated the encode() documentation to not mention "ASCII", and to be more specific about what 'escape' does. Backpatched to 8.2.X. --------------------------------------------------------------------------- Michael Fuhr wrote: > On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote: > > Perhaps my understanding of the 'encode' function is incorrect, but I > > was under the impression that I could do something like: > > > > SELECT lower(encode(bytes, 'escape')) FROM mytable; > > > > as it sounded like (from the manual) that 'encode' would return valid > > ASCII, with all the non-ascii bytes hex escaped. > > The documentation for encode() does give that impression: "Encode > binary string to ASCII-only representation. Supported types are: > base64, hex, escape." However, the source code for esc_encode() > in src/backend/utils/adt/encode.c says and does otherwise: > > * Only two characters are escaped: > * \0 (null) and \\ (backslash) > > > When I have the byte 0x8a, however, I get the error: > > > > ERROR: invalid byte sequence for encoding "UTF8": 0x8a > > Since encode() returns text and doesn't escape non-ASCII characters, > all of the original binary data will be treated as though it's text > in the database's encoding. If the data contains byte sequences > that aren't valid in that encoding then you get the above error. > > > I have the sneaking suspicion that I am missing something, so please > > correct me if I am wrong. If I am wrong, is there a better way to > > lowercase all the ascii characters in a bytea string? > > What are you trying to do? What is the binary data and why are you > treating it (or part of it) as though it's text? Do you want the > end result to be text with escape sequences or do you want to convert > it back to bytea? > > Something like this might work: > > SELECT lower(textin(byteaout(bytes))) FROM mytable; > > To turn the result back into bytea: > > SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable; > > -- > Michael Fuhr > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/func.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/func.sgml,v retrieving revision 1.368 diff -c -c -r1.368 func.sgml *** doc/src/sgml/func.sgml 20 Feb 2007 18:18:05 -0000 1.368 --- doc/src/sgml/func.sgml 20 Feb 2007 19:48:37 -0000 *************** *** 1356,1363 **** </entry> <entry><type>text</type></entry> <entry> ! Encode binary data to <acronym>ASCII</acronym>-only representation. Supported types are: <literal>base64</>, <literal>hex</>, <literal>escape</>. </entry> <entry><literal>encode( E'123\\000\\001', 'base64')</literal></entry> <entry><literal>MTIzAAE=</literal></entry> --- 1356,1365 ---- </entry> <entry><type>text</type></entry> <entry> ! Encode binary data to different representation. Supported types are: <literal>base64</>, <literal>hex</>, <literal>escape</>. + <literal>Escape</> merely outputs null bytes as <literal>\000</> and + doubles backslashes. </entry> <entry><literal>encode( E'123\\000\\001', 'base64')</literal></entry> <entry><literal>MTIzAAE=</literal></entry>