Thread: I/O support for composite types
There's just one thing left to do to make composite types useful as table columns: we have to support I/O of composite values. (Without this, pg_dump would fail to work on such columns, rendering them not very useful in the real world.) This means we have to hammer out a definition for what the external representation is. Here are my thoughts on the subject. Textual representation: I am inclined to define this similarly to the representation for arrays; however, we need to allow for NULLs. I suggest {item,item,item} The separator is always comma (it can't be type-specific since the items might have different types). Backslashes and double quotes can be used in the usual ways to quote characters in the item strings. If an item string is completely empty it is taken as NULL; to write an actual empty-string value, you must write "". There is an ambiguity whether '{}' represents a zero-column row or a one-column row containing a NULL, but I don't think this is a problem since the input converter will always know how many columns it is expecting. There are a couple of fine points of the array I/O behavior that I think we should not emulate. One is that leading whitespace in an item string is discarded. This seems inconsistent, mainly because trailing whitespace isn't discarded. In the cases where it really makes sense to discard whitespace (namely numeric datatypes), the underlying datatype's input converter can do that just fine, and so I suggest that the record converter itself should not discard whitespace. It seems OK to ignore whitespace before and after the outer braces, however. The other fine point has to do with double quoting. In the array code,{a"b""c"d} is legal input representing an item 'abcd'. I think it would be more consistent with usual SQL conventions to treat it as meaning 'ab"cd', that is a doubled double quote within double quotes should represent a double quote not nothing. Anyone have a strong feeling one way or the other? (In the long run we might want to think about making these same changes in array_in, but that's a can of worms I don't wish to open today.) Binary representation: This seems relatively easy. I propose we send number of fields (int4) followed by, for each field: type oid (sizeof(Oid)), data length (int4), data according to the binary representation of the field datatype. The field count and type oids are not strictly necessary but seem like a good idea for error-checking purposes. Infrastructure changes: record_out/record_send can extract the needed type info right from the Datum, but record_in/record_recv really need to be told what data type to expect, and the current call conventions for input converters don't pass them any useful information. I propose that we adjust the present definitions so that the second argument passed to I/O conversion routines, rather than being always pg_type.typelem, is defined as "if pg_type.typtype is 'c' then pg_type.oid else pg_type.typelem". That is, for composite types we'll pass the type's own OID in place of typelem. This does not affect I/O routines for user-defined types, since there are no user-defined I/O routines for composite types. It could break any user-written code that calls I/O routines, if it's been hard-wired to pass typelem instead of using one of the support routines like getTypeInputInfo() or get_type_io_data() to collect the parameters to pass. By my count there are about a dozen places in the backend code that will need to be fixed to use one of these routines instead of having a hard-wired typelem reference. An alternative definition that might be more useful in the long run is to define the second parameter as "if pg_type.typelem is not zero then pg_type.typelem else pg_type.oid". In other words, for everything *except* arrays we'd pass the type OID. This would allow I/O routines to be written to support multiple datatypes. However there seems a larger chance of breaking things if we do this, and I'm also fuzzy on which OID to pass for domain types. So I'm inclined to keep it conservative for now, and change the behavior only for composite types. Comments, objections, better ideas? regards, tom lane
Tom Lane wrote: > I am inclined to define this similarly to the representation for arrays; > however, we need to allow for NULLs. I suggest > > {item,item,item} > > The separator is always comma (it can't be type-specific since the items > might have different types). Backslashes and double quotes can be used > in the usual ways to quote characters in the item strings. If an item > string is completely empty it is taken as NULL; to write an actual > empty-string value, you must write "". There is an ambiguity whether > '{}' represents a zero-column row or a one-column row containing a NULL, > but I don't think this is a problem since the input converter will > always know how many columns it is expecting. > > There are a couple of fine points of the array I/O behavior that I think > we should not emulate. One is that leading whitespace in an item string > is discarded. This seems inconsistent, mainly because trailing > whitespace isn't discarded. In the cases where it really makes sense to > discard whitespace (namely numeric datatypes), the underlying datatype's > input converter can do that just fine, and so I suggest that the record > converter itself should not discard whitespace. It seems OK to ignore > whitespace before and after the outer braces, however. > > The other fine point has to do with double quoting. In the array code, > {a"b""c"d} > is legal input representing an item 'abcd'. I think it would be more > consistent with usual SQL conventions to treat it as meaning 'ab"cd', > that is a doubled double quote within double quotes should represent a > double quote not nothing. Anyone have a strong feeling one way or the > other? Why not use standard C semantics for the textual representation with your addition that empty items are NULL? It becomes fairly stright forward, IMO highly readable, and the rules to define both arrays and complex types are well known and documented. Here's an array of two composite elements of the same type. The last two items of the second element is NULL. The type is {int, double, string, char} { {12, 123.4, "some string with \"a qouted string\" inside of it", 'c'}, {13, -3.2,,} } This will also allow you to distinguish strings from identifiers. That might prove extremely important if you ever plan to serialize self referencing structures (a structure could then represent itself as ref_<oid> or something and thereby refer to itself). Kind regards, Thomas Hallgren
Thomas Hallgren <thhal@mailblocks.com> writes: > Why not use standard C semantics for the textual representation with > your addition that empty items are NULL? This isn't C, it's SQL; and I think the array I/O representation is a closer precedent for us than the C standard. In any case, how much of C syntax are you proposing to emulate exactly? Comments? Backslashed newlines? Joining of adjacent double-quoted strings? Conversion of octal and hex integer constants (and what about L, U, LL, etc suffixes)? There's a lot more stuff there than meets the eye, and most of it isn't something I want to code. regards, tom lane
Composite types will work recursively, right? That is a composite type inside of a composite type column? Does the SQL dot syntax support this nested referencing? Or are we only allowing one level. Why not just use the syntax of the insert values with parens? insert into tble values (...); is very familiar so the corresponding: insert into table values ( 'xxx', ('yyy', 123), 456 ); is also easy to understand and remember: a row is being inserted. Is there a specific reason why you want curly brackets? I have not been following this much to my chagrin. On Sat, Jun 05, 2004 at 12:57:27PM -0400, Tom Lane wrote: > good stuff deleted... > > There are a couple of fine points of the array I/O behavior that I think > we should not emulate. One is that leading whitespace in an item string > is discarded. This seems inconsistent, mainly because trailing > whitespace isn't discarded. In the cases where it really makes sense to > discard whitespace (namely numeric datatypes), the underlying datatype's > input converter can do that just fine, and so I suggest that the record > converter itself should not discard whitespace. It seems OK to ignore > whitespace before and after the outer braces, however. If the whitespace is inside of the item, do not discard it; let the underlying type deal with it. If the white space is outside of the item, ignore it. I think you probably meant this, but just to be sure. { " item number one " } ==> input_text( " itemnumber one " ) > more good stuff deleted > > Comments, objections, better ideas? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) --elein ============================================================ elein@varlena.com Varlena, LLC www.varlena.com PostgreSQL Consulting, Support & Training PostgreSQL General Bits http://www.varlena.com/GeneralBits/ ============================================================= I have always depended on the [QA] of strangers.
Tom Lane wrote: > Thomas Hallgren <thhal@mailblocks.com> writes: > >>Why not use standard C semantics for the textual representation with >>your addition that empty items are NULL? > > > This isn't C, it's SQL; and I think the array I/O representation is a > closer precedent for us than the C standard. > > In any case, how much of C syntax are you proposing to emulate exactly? > Comments? Backslashed newlines? Joining of adjacent double-quoted > strings? Conversion of octal and hex integer constants (and what about > L, U, LL, etc suffixes)? There's a lot more stuff there than meets the > eye, and most of it isn't something I want to code. > I'm not proposing a full C parser implementation :-) Just static data initializer part. To answer how much of the C syntax: Comments, no. SQL has a standard for comments that doesn't conflict with C semantics for data initializers. Joining of adjacent double-quoted strings. Yes, of course. That's what you already do for arrays today. Without this, it will be hard to write long strings in a readable way. Conversion of backslashed newlines, octal and integer constants within strings, yes, why not? The issue of non-printables needs to be addressed somehow. What do you propose? Regarding the L, U, LL suffixes, depends in what way do you plan to tackle different character sets. Perhaps UTF-8 with unicode escapes would be better. Some mechanism i needed, that's for sure. Kind regards, Thomas Hallgren
elein <elein@varlena.com> writes: > Composite types will work recursively, right? > That is a composite type inside of a composite type column? You can have that, but I'm not intending that the I/O syntax be explicitly aware of it. A composite type field would just be an item (and hence would have to be suitably quoted). So it would look something like{somecolumn,"{anothercolumn,\"a quoted column\"}",column3} if we go with the syntax I originally suggested. Note that just as we offer ARRAY[] to avoid having to write this sort of thing in SQL statements, we offer ROW() so you can synthesize composite values without actually having to write this junk. I see this mainly as a dump/reload representation, so I'm not too worried about whether complex cases can be written simply. > Does the SQL dot syntax support this nested referencing? > Or are we only allowing one level. You have to parenthesize to avoid ambiguity against the normal "table.field" notation, but beyond that it works. For instance (this is a real example with CVS tip + error check removed): regression=# create type complex as (r float8, i float8); CREATE TYPE regression=# create table foo (c complex); CREATE TABLE regression=# insert into foo values(row(1.1, 2.2)); INSERT 154998 1 -- this doesn't work yet: regression=# select c from foo f; ERROR: output of composite types not implemented yet -- here is the ambiguity problem: regression=# select c.r from foo f; NOTICE: adding missing FROM-clause entry for table "c" ERROR: column c.r does not exist -- which you can fix like this: regression=# select (c).r, (f.c).i from foo f; r | i -----+-----1.1 | 2.2 (1 row) -- nested types work about like you'd expect: regression=# create type quad as (c1 complex, c2 complex); CREATE TYPE regression=# create table bar (q quad); CREATE TABLE regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4))); INSERT 155006 1 regression=# select (q).c2.r from bar; r -----3.3 (1 row) > Why not just use the syntax of the insert values with parens? > insert into tble values (...); > is very familiar so the corresponding: > insert into table values ( 'xxx', ('yyy', 123), 456 ); > is also easy to understand and remember: a row is being inserted. I don't particularly care one way or the other about parens versus braces; anyone else have an opinion on that? However, I do want to follow the array syntax to the extent of using double not single quotes for quoting items. Otherwise you've got a mess when you do try to write one of these things as a SQL literal. For instance, instead of'{"1.1","2.2"}'::complex you'd have to write'{\'1.1\',\'2.2\'}'::complex which is just painful. (In this particular example of course the inner quotes could just be dropped entirely, but with textual fields they would often be necessary.) regards, tom lane
I wrote: > regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4))); BTW, I forgot to mention that the keyword ROW is optional as long as you've got at least two items in the row expression, so the above can be simplified to regression=# insert into bar values (((1.1, 2.2), (3.3,4.4))); INSERT 155011 1 Some other examples: regression=# select (1,2)::complex; ERROR: output of composite types not implemented yet regression=# select cast ((1,2) as complex); ERROR: output of composite types not implemented yet Looking at these, it does seem like it would be natural to get back complex --------- (1,2) so I'll now agree with you that the I/O syntax should use parens not braces as the outer delimiters. regards, tom lane
Good reason. Now I'm excited. I'll download and run tests and try to do a write up in general bits next week. cheers, elein On Sat, Jun 05, 2004 at 05:00:24PM -0400, Tom Lane wrote: > I wrote: > > regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4))); > > BTW, I forgot to mention that the keyword ROW is optional as long as > you've got at least two items in the row expression, so the above can > be simplified to > > regression=# insert into bar values (((1.1, 2.2), (3.3,4.4))); > INSERT 155011 1 > > Some other examples: > > regression=# select (1,2)::complex; > ERROR: output of composite types not implemented yet > regression=# select cast ((1,2) as complex); > ERROR: output of composite types not implemented yet > > Looking at these, it does seem like it would be natural to get back > > complex > --------- > (1,2) > > so I'll now agree with you that the I/O syntax should use parens not > braces as the outer delimiters. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Tom Lane wrote: >>Why not just use the syntax of the insert values with parens? >> insert into tble values (...); >>is very familiar so the corresponding: >> insert into table values ( 'xxx', ('yyy', 123), 456 ); >>is also easy to understand and remember: a row is being inserted. > > > I don't particularly care one way or the other about parens versus > braces; anyone else have an opinion on that? > My vote would be on parens. It's more coherent. Do you use braces anywhere else? Kind regards, Thomas Hallgren
> > regression=# insert into bar values (row(row(1.1, 2.2), row(3.3, 4.4))); > > BTW, I forgot to mention that the keyword ROW is optional as long as > you've got at least two items in the row expression, so the above can > be simplified to > > regression=# insert into bar values (((1.1, 2.2), (3.3,4.4))); > INSERT 155011 1 > > Some other examples: > > regression=# select (1,2)::complex; > ERROR: output of composite types not implemented yet > regression=# select cast ((1,2) as complex); > ERROR: output of composite types not implemented yet > > Looking at these, it does seem like it would be natural to get back > > complex > --------- > (1,2) > > so I'll now agree with you that the I/O syntax should use parens not > braces as the outer delimiters. Following this path, perhaps the array i/o syntax should be changed to use []s and the keyword ARRAY should likewise be optional in the array constructor. That would let people do things like "insert into bar values ([(1,2),(2,3)])" to insert a list of point/complex data structures. and get back '[(1,2),(2,3)]' in their dumps. Personally I would have been more inclined to use braces for structs in both places. And either parens or brackets for arrays. But eh. This whole thing is just too cool to worry about the choice of delimiters. -- greg
Greg Stark <gsstark@mit.edu> writes: > Following this path, perhaps the array i/o syntax should be changed to > use []s I would think about that if there weren't compatibility issues to worry about, but in practice the pain from such an incompatible change would vastly outweigh the benefit. > and the keyword ARRAY should likewise be optional in the array constructor. Not sure this is syntactically feasible, or a good idea even if it is possible to get bison to take it --- it might foreclose more useful syntactic ideas later on. (I wouldn't think that omitting ROW is a good idea either, but the spec says we have to.) regards, tom lane