Thread: I/O support for composite types

I/O support for composite types

From

Tom Lane

Date:

05 June 2004, 13:57:41

There's just one thing left to do to make composite types useful as
table columns: we have to support I/O of composite values.  (Without
this, pg_dump would fail to work on such columns, rendering them not
very useful in the real world.)  This means we have to hammer out a
definition for what the external representation is.  Here are my
thoughts on the subject.


Textual representation:

I am inclined to define this similarly to the representation for arrays;
however, we need to allow for NULLs.  I suggest
{item,item,item}

The separator is always comma (it can't be type-specific since the items
might have different types).  Backslashes and double quotes can be used
in the usual ways to quote characters in the item strings.  If an item
string is completely empty it is taken as NULL; to write an actual
empty-string value, you must write "".  There is an ambiguity whether
'{}' represents a zero-column row or a one-column row containing a NULL,
but I don't think this is a problem since the input converter will
always know how many columns it is expecting.

There are a couple of fine points of the array I/O behavior that I think
we should not emulate.  One is that leading whitespace in an item string
is discarded.  This seems inconsistent, mainly because trailing
whitespace isn't discarded.  In the cases where it really makes sense to
discard whitespace (namely numeric datatypes), the underlying datatype's
input converter can do that just fine, and so I suggest that the record
converter itself should not discard whitespace.  It seems OK to ignore
whitespace before and after the outer braces, however.

The other fine point has to do with double quoting.  In the array code,{a"b""c"d}
is legal input representing an item 'abcd'.  I think it would be more
consistent with usual SQL conventions to treat it as meaning 'ab"cd',
that is a doubled double quote within double quotes should represent a
double quote not nothing.  Anyone have a strong feeling one way or the
other?

(In the long run we might want to think about making these same changes
in array_in, but that's a can of worms I don't wish to open today.)


Binary representation:

This seems relatively easy.  I propose we send number of fields (int4)
followed by, for each field: type oid (sizeof(Oid)), data length (int4),
data according to the binary representation of the field datatype.
The field count and type oids are not strictly necessary but seem like
a good idea for error-checking purposes.


Infrastructure changes:

record_out/record_send can extract the needed type info right from the
Datum, but record_in/record_recv really need to be told what data type
to expect, and the current call conventions for input converters don't
pass them any useful information.  I propose that we adjust the present
definitions so that the second argument passed to I/O conversion
routines, rather than being always pg_type.typelem, is defined as
"if pg_type.typtype is 'c' then pg_type.oid else pg_type.typelem".
That is, for composite types we'll pass the type's own OID in place of
typelem.

This does not affect I/O routines for user-defined types, since there
are no user-defined I/O routines for composite types.  It could break
any user-written code that calls I/O routines, if it's been hard-wired
to pass typelem instead of using one of the support routines like
getTypeInputInfo() or get_type_io_data() to collect the parameters to
pass.  By my count there are about a dozen places in the backend code
that will need to be fixed to use one of these routines instead of
having a hard-wired typelem reference.

An alternative definition that might be more useful in the long run is
to define the second parameter as
"if pg_type.typelem is not zero then pg_type.typelem else pg_type.oid".
In other words, for everything *except* arrays we'd pass the type OID.
This would allow I/O routines to be written to support multiple
datatypes.  However there seems a larger chance of breaking things if
we do this, and I'm also fuzzy on which OID to pass for domain types.
So I'm inclined to keep it conservative for now, and change the
behavior only for composite types.


Comments, objections, better ideas?
        regards, tom lane

Re: I/O support for composite types

From

Thomas Hallgren

Date:

05 June 2004, 15:47:53

Tom Lane wrote:
> I am inclined to define this similarly to the representation for arrays;
> however, we need to allow for NULLs.  I suggest
> 
>     {item,item,item}
> 
> The separator is always comma (it can't be type-specific since the items
> might have different types).  Backslashes and double quotes can be used
> in the usual ways to quote characters in the item strings.  If an item
> string is completely empty it is taken as NULL; to write an actual
> empty-string value, you must write "".  There is an ambiguity whether
> '{}' represents a zero-column row or a one-column row containing a NULL,
> but I don't think this is a problem since the input converter will
> always know how many columns it is expecting.
> 
> There are a couple of fine points of the array I/O behavior that I think
> we should not emulate.  One is that leading whitespace in an item string
> is discarded.  This seems inconsistent, mainly because trailing
> whitespace isn't discarded.  In the cases where it really makes sense to
> discard whitespace (namely numeric datatypes), the underlying datatype's
> input converter can do that just fine, and so I suggest that the record
> converter itself should not discard whitespace.  It seems OK to ignore
> whitespace before and after the outer braces, however.
> 
> The other fine point has to do with double quoting.  In the array code,
>     {a"b""c"d}
> is legal input representing an item 'abcd'.  I think it would be more
> consistent with usual SQL conventions to treat it as meaning 'ab"cd',
> that is a doubled double quote within double quotes should represent a
> double quote not nothing.  Anyone have a strong feeling one way or the
> other?

Why not use standard C semantics for the textual representation with 
your addition that empty items are NULL? It becomes fairly stright 
forward, IMO highly readable, and the rules to define both arrays and 
complex types are well known and documented.

Here's an array of two composite elements of the same type. The last two 
items of the second element is NULL. The type is {int, double, string, char}

{  {12, 123.4, "some string with \"a qouted string\" inside of it", 'c'},  {13, -3.2,,}
}

This will also allow you to distinguish strings from identifiers. That 
might prove extremely important if you ever plan to serialize self 
referencing structures (a structure could then represent itself as 
ref_<oid> or something and thereby refer to itself).

Kind regards,

Thomas Hallgren

Re: I/O support for composite types

From

Tom Lane

Date:

05 June 2004, 16:15:42

Thomas Hallgren <thhal@mailblocks.com> writes:
> Why not use standard C semantics for the textual representation with 
> your addition that empty items are NULL?

This isn't C, it's SQL; and I think the array I/O representation is a
closer precedent for us than the C standard.

In any case, how much of C syntax are you proposing to emulate exactly?
Comments?  Backslashed newlines?  Joining of adjacent double-quoted
strings?  Conversion of octal and hex integer constants (and what about
L, U, LL, etc suffixes)?  There's a lot more stuff there than meets the
eye, and most of it isn't something I want to code.
        regards, tom lane

Re: I/O support for composite types

From

elein

Date:

05 June 2004, 16:17:44

Composite types will work recursively, right?
That is a composite type inside of a composite type column?
Does the SQL dot syntax support this nested referencing?
Or are we only allowing one level.

Why not just use the syntax of the insert values with parens?
insert into tble values (...); 

is very familiar so the corresponding:
insert into table values ( 'xxx', ('yyy', 123), 456 );

is also easy to understand and remember: a row is being inserted.  

Is there a specific reason why you want curly brackets?
I have not been following this much to my chagrin.

On Sat, Jun 05, 2004 at 12:57:27PM -0400, Tom Lane wrote:
> good stuff deleted...
> 
> There are a couple of fine points of the array I/O behavior that I think
> we should not emulate.  One is that leading whitespace in an item string
> is discarded.  This seems inconsistent, mainly because trailing
> whitespace isn't discarded.  In the cases where it really makes sense to
> discard whitespace (namely numeric datatypes), the underlying datatype's
> input converter can do that just fine, and so I suggest that the record
> converter itself should not discard whitespace.  It seems OK to ignore
> whitespace before and after the outer braces, however.

If the whitespace is inside of the item, do not discard it; let the
underlying type deal with it. If the white space is outside of the
item, ignore it.  I think you probably meant this, but just to be sure.  {   "  item number one  " } ==> input_text( "
itemnumber one  " )

> more good stuff deleted
> 
> Comments, objections, better ideas?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--elein
============================================================
elein@varlena.com        Varlena, LLC        www.varlena.com
         PostgreSQL Consulting, Support & Training   

PostgreSQL General Bits   http://www.varlena.com/GeneralBits/
=============================================================
I have always depended on the [QA] of strangers.

Re: I/O support for composite types

From

Thomas Hallgren

Date:

05 June 2004, 16:47:48

Tom Lane wrote:
> Thomas Hallgren <thhal@mailblocks.com> writes:
> 
>>Why not use standard C semantics for the textual representation with 
>>your addition that empty items are NULL?
> 
> 
> This isn't C, it's SQL; and I think the array I/O representation is a
> closer precedent for us than the C standard.
> 
> In any case, how much of C syntax are you proposing to emulate exactly?
> Comments?  Backslashed newlines?  Joining of adjacent double-quoted
> strings?  Conversion of octal and hex integer constants (and what about
> L, U, LL, etc suffixes)?  There's a lot more stuff there than meets the
> eye, and most of it isn't something I want to code.
> 

I'm not proposing a full C parser implementation :-) Just static data 
initializer part.

To answer how much of the C syntax:

Comments, no. SQL has a standard for comments that doesn't conflict with 
C semantics for data initializers.

Joining of adjacent double-quoted strings. Yes, of course. That's what 
you already do for arrays today. Without this, it will be hard to write 
long strings in a readable way.

Conversion of backslashed newlines, octal and integer constants within 
strings, yes, why not? The issue of non-printables needs to be addressed 
somehow. What do you propose?

Regarding the L, U, LL suffixes, depends in what way do you plan to 
tackle different character sets. Perhaps UTF-8 with unicode escapes 
would be better. Some mechanism i needed, that's for sure.

Kind regards,

Thomas Hallgren

Re: I/O support for composite types

From

Tom Lane

Date:

05 June 2004, 17:24:14

elein <elein@varlena.com> writes:
> Composite types will work recursively, right?
> That is a composite type inside of a composite type column?

You can have that, but I'm not intending that the I/O syntax be
explicitly aware of it.  A composite type field would just be an
item (and hence would have to be suitably quoted).  So it would
look something like{somecolumn,"{anothercolumn,\"a quoted column\"}",column3}
if we go with the syntax I originally suggested.

Note that just as we offer ARRAY[] to avoid having to write this sort
of thing in SQL statements, we offer ROW() so you can synthesize
composite values without actually having to write this junk.  I see
this mainly as a dump/reload representation, so I'm not too worried
about whether complex cases can be written simply.

> Does the SQL dot syntax support this nested referencing?
> Or are we only allowing one level.

You have to parenthesize to avoid ambiguity against the normal
"table.field" notation, but beyond that it works.  For instance
(this is a real example with CVS tip + error check removed):

regression=# create type complex as (r float8, i float8);
CREATE TYPE
regression=# create table foo (c complex);
CREATE TABLE
regression=# insert into foo values(row(1.1, 2.2));
INSERT 154998 1
-- this doesn't work yet:
regression=# select c from foo f;
ERROR:  output of composite types not implemented yet
-- here is the ambiguity problem:
regression=# select c.r from foo f;
NOTICE:  adding missing FROM-clause entry for table "c"
ERROR:  column c.r does not exist
-- which you can fix like this:
regression=# select (c).r, (f.c).i from foo f; r  |  i
-----+-----1.1 | 2.2
(1 row)

-- nested types work about like you'd expect:
regression=# create type quad as (c1 complex, c2 complex);
CREATE TYPE
regression=# create table bar (q quad);
CREATE TABLE
regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4)));
INSERT 155006 1
regression=# select (q).c2.r from bar; r
-----3.3
(1 row)

> Why not just use the syntax of the insert values with parens?
>     insert into tble values (...); 
> is very familiar so the corresponding:
>     insert into table values ( 'xxx', ('yyy', 123), 456 );
> is also easy to understand and remember: a row is being inserted.  

I don't particularly care one way or the other about parens versus
braces; anyone else have an opinion on that?

However, I do want to follow the array syntax to the extent of using
double not single quotes for quoting items.  Otherwise you've got a mess
when you do try to write one of these things as a SQL literal.
For instance, instead of'{"1.1","2.2"}'::complex
you'd have to write'{\'1.1\',\'2.2\'}'::complex
which is just painful.  (In this particular example of course the inner
quotes could just be dropped entirely, but with textual fields they
would often be necessary.)
        regards, tom lane

Re: I/O support for composite types

From

Tom Lane

Date:

05 June 2004, 18:00:25

I wrote:
> regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4)));

BTW, I forgot to mention that the keyword ROW is optional as long as
you've got at least two items in the row expression, so the above can
be simplified to

regression=# insert into bar values (((1.1, 2.2), (3.3,4.4)));
INSERT 155011 1

Some other examples:

regression=# select (1,2)::complex;
ERROR:  output of composite types not implemented yet
regression=# select cast ((1,2) as complex);
ERROR:  output of composite types not implemented yet

Looking at these, it does seem like it would be natural to get back
complex
--------- (1,2)

so I'll now agree with you that the I/O syntax should use parens not
braces as the outer delimiters.
        regards, tom lane

Re: I/O support for composite types

From

elein

Date:

05 June 2004, 18:38:50

Good reason. Now I'm excited.  I'll download and run
tests and try to do a write up in general bits next week.

cheers,
elein

On Sat, Jun 05, 2004 at 05:00:24PM -0400, Tom Lane wrote:
> I wrote:
> > regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4)));
> 
> BTW, I forgot to mention that the keyword ROW is optional as long as
> you've got at least two items in the row expression, so the above can
> be simplified to
> 
> regression=# insert into bar values (((1.1, 2.2), (3.3,4.4)));
> INSERT 155011 1
> 
> Some other examples:
> 
> regression=# select (1,2)::complex;
> ERROR:  output of composite types not implemented yet
> regression=# select cast ((1,2) as complex);
> ERROR:  output of composite types not implemented yet
> 
> Looking at these, it does seem like it would be natural to get back
> 
>  complex
> ---------
>   (1,2)
> 
> so I'll now agree with you that the I/O syntax should use parens not
> braces as the outer delimiters.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: I/O support for composite types

From

Thomas Hallgren

Date:

05 June 2004, 18:47:46

Tom Lane wrote:
>>Why not just use the syntax of the insert values with parens?
>>    insert into tble values (...); 
>>is very familiar so the corresponding:
>>    insert into table values ( 'xxx', ('yyy', 123), 456 );
>>is also easy to understand and remember: a row is being inserted.  
> 
> 
> I don't particularly care one way or the other about parens versus
> braces; anyone else have an opinion on that?
> 

My vote would be on parens. It's more coherent. Do you use braces 
anywhere else?

Kind regards,

Thomas Hallgren

Re: I/O support for composite types

From

Greg Stark

Date:

10 June 2004, 04:33:01

> > regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4)));
> 
> BTW, I forgot to mention that the keyword ROW is optional as long as
> you've got at least two items in the row expression, so the above can
> be simplified to
> 
> regression=# insert into bar values (((1.1, 2.2), (3.3,4.4)));
> INSERT 155011 1
> 
> Some other examples:
> 
> regression=# select (1,2)::complex;
> ERROR:  output of composite types not implemented yet
> regression=# select cast ((1,2) as complex);
> ERROR:  output of composite types not implemented yet
> 
> Looking at these, it does seem like it would be natural to get back
> 
>  complex
> ---------
>   (1,2)
> 
> so I'll now agree with you that the I/O syntax should use parens not
> braces as the outer delimiters.


Following this path, perhaps the array i/o syntax should be changed to use []s
and the keyword ARRAY should likewise be optional in the array constructor.

That would let people do things like "insert into bar values ([(1,2),(2,3)])"
to insert a list of point/complex data structures. and get back
'[(1,2),(2,3)]' in their dumps.


Personally I would have been more inclined to use braces for structs in both
places. And either parens or brackets for arrays. But eh. This whole thing is
just too cool to worry about the choice of delimiters.

-- 
greg

Re: I/O support for composite types

From

Tom Lane

Date:

10 June 2004, 11:26:21

Greg Stark <gsstark@mit.edu> writes:
> Following this path, perhaps the array i/o syntax should be changed to
> use []s

I would think about that if there weren't compatibility issues to worry
about, but in practice the pain from such an incompatible change would
vastly outweigh the benefit.

> and the keyword ARRAY should likewise be optional in the array constructor.

Not sure this is syntactically feasible, or a good idea even if it is
possible to get bison to take it --- it might foreclose more useful
syntactic ideas later on.  (I wouldn't think that omitting ROW is a
good idea either, but the spec says we have to.)
        regards, tom lane