Thread: sorting table columns
Hi, I've been trying to implement the holy grail of decoupling logical/physical column sort order representation, i.e., the feature that lets the server have one physical order, for storage compactness, and a different "output" order that can be tweaked by the user. This has been discussed many times; most recently, I believe, here: http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php with implementation details here: http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php The idea described there by Tom, and upon which I formed a vague implementation plan in my head, is that I was to look for all uses of an "attnum", and then replace it by either "attlognum" (i.e. the user-visible sort identifier) or "attphysnum" (i.e. the order of attributes as stored on disk). This turned out to be far from the truth; the way things really work is that tupledescs are constructed from catalogs, which are then converted into target lists, and those are turned back into tupledescs in some places or into tupleslots in others. So the real implementation is about making sure that we read the column order ids, and preserve them appropriately while the query travels through parser, rewriter, planner, executor, and to client. To this end, I added members to nodes Var and TargetEntry; this lets me carry the catalog data down. This isn't particularly complex, though it was quite a challenge figuring out exactly the changes that made sense. Soon thereafter I noticed that the column sort order needs to be preserved in RangeTblEntry too, so I added a list of ints there to map from logical to canonical. So far so good. I made a few simple cases work: "select * from foo" works correctly, of course, as do joins using ON, USING and NATURAL. See attached patch (very much a WIP). (Note that for business reasons there is no SQL syntax to fool around with logical column numbers; what I do to test this is create a table and then UPDATE the pg_attribute.attlognum entries to create a different order. Also I haven't gotten into the business of handling a different physical order.) My next test case was a SQL function. There, things crashed and burned immediately and it took me some time to realize that the reason for this is the DestReceiver stuff: the patch I wrote to handle the basic cases simply sorts attrs in logical order to pass to the receiveSlot function (printtup in those basic cases), but this obviously affects how the tuple is passed to other DestReceivers too. So the function DR is getting the attributes in logical order, and then trying to stuff them into a tuplestore as a minimaltuple. But the underlying code tries to compute datum lengths using the TupleDesc and it doesn't use logical order, but just canonical (catalog) order, which doesn't match the data values. So it crashes. So at this point I'm at a crossroads. One idea was to avoid sending tuples in logical order unless the DR explicitely requests for it. So the printtup DR would set a flag so that ExecutePlan would send tuples in logical order; other DRs would not set this flag, and executor would behave normally. What's not clear to me is that this is feasible at all, because the order in which attrs are sent out are defined pretty early in parser stages, so maybe we don't know enough about the DR yet. Another idea was to modify the rest of the DRs so that they are aware that the tuples they are being passed are in logical order. Maybe this is all wrong and I need to take a completely different approach. In particular, if I'm completely on the wrong track about this, I want to know as soon as possible! Ideas? Opinions? -- Álvaro Herrera <alvherre@alvh.no-ip.org>
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I've been trying to implement the holy grail of decoupling > logical/physical column sort order representation, i.e., the feature > that lets the server have one physical order, for storage compactness, > and a different "output" order that can be tweaked by the user. This > has been discussed many times; most recently, I believe, here: > http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php > with implementation details here: > http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php > The idea described there by Tom, and upon which I formed a vague > implementation plan in my head, is that I was to look for all uses of > an "attnum", and then replace it by either "attlognum" (i.e. the > user-visible sort identifier) or "attphysnum" (i.e. the order of > attributes as stored on disk). I thought we'd concluded that we really need three values: attnum should be a permanent logical ID for each column, and then the user-visible column order would be determined by a different number, and the on-disk column order by a third. If we're going to do this at all, it seems like a seriously bad idea to only go halfway, because then we'll just have to revisit all the same code again later. You do *not* want to store either of the latter two numbers in parse-time Var nodes, because then you can't rearrange columns without having to update stored rules. But it might be useful to decree that one thing setrefs.c does is renumber Vars in scan nodes to use the physical column numbers instead of the permanent IDs. I haven't looked into any of the details, but I would guess that targetlists should always be constructed in logical (user-visible) column order. TupleDescs need to match the physical order, most likely. Note that all three orderings are always going to be the same everywhere above the table scan level. (And I suppose COPY will need some hack or other.) regards, tom lane
Excerpts from Tom Lane's message of mar dic 20 18:24:29 -0300 2011: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > I've been trying to implement the holy grail of decoupling > > logical/physical column sort order representation, i.e., the feature > > that lets the server have one physical order, for storage compactness, > > and a different "output" order that can be tweaked by the user. This > > has been discussed many times; most recently, I believe, here: > > http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php > > with implementation details here: > > http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php > > > The idea described there by Tom, and upon which I formed a vague > > implementation plan in my head, is that I was to look for all uses of > > an "attnum", and then replace it by either "attlognum" (i.e. the > > user-visible sort identifier) or "attphysnum" (i.e. the order of > > attributes as stored on disk). > > I thought we'd concluded that we really need three values: attnum should > be a permanent logical ID for each column, and then the user-visible > column order would be determined by a different number, and the on-disk > column order by a third. If we're going to do this at all, it seems > like a seriously bad idea to only go halfway, because then we'll just > have to revisit all the same code again later. Yeah, I was unclear -- that's what I'm doing (or, rather, attempting to do). > You do *not* want to store either of the latter two numbers in > parse-time Var nodes, because then you can't rearrange columns without > having to update stored rules. But it might be useful to decree that > one thing setrefs.c does is renumber Vars in scan nodes to use the > physical column numbers instead of the permanent IDs. Hmm, having the numbers in Var nodes seems a fundamental part of the way I'm attacking the problem. Hopefully after I give setrefs.c a read I will have a clearer picture of the way to do it without that. > I haven't looked into any of the details, but I would guess that > targetlists should always be constructed in logical (user-visible) > column order. TupleDescs need to match the physical order, most > likely. Note that all three orderings are always going to be the same > everywhere above the table scan level. (And I suppose COPY will need > some hack or other.) Okay. AFAICS this shoots down the idea of modifying destreceivers, which is good because I was coming to that conclusion for a different reason. Thanks for the pointers. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of mar dic 20 18:24:29 -0300 2011: >> You do *not* want to store either of the latter two numbers in >> parse-time Var nodes, because then you can't rearrange columns without >> having to update stored rules. But it might be useful to decree that >> one thing setrefs.c does is renumber Vars in scan nodes to use the >> physical column numbers instead of the permanent IDs. > Hmm, having the numbers in Var nodes seems a fundamental part of the way > I'm attacking the problem. Hopefully after I give setrefs.c a read I > will have a clearer picture of the way to do it without that. To clarify a bit: one thing that setrefs.c already does is to renumber Var nodes above the scan level, so that their attnums refer not to original table column attnums but to column numbers in the output of the next plan level down. Vars in scan nodes currently don't need any renumbering, but it'd be easy enough to extend the logic to do something to them as well. I'm visualizing the run-time transformation from physical to logical column ordering as a sort of projection, much like the mapping that happens in a join node. regards, tom lane
On Tue, Dec 20, 2011 at 9:47 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> > The idea described there by Tom, and upon which I formed a vague >> > implementation plan in my head, is that I was to look for all uses of >> > an "attnum", and then replace it by either "attlognum" (i.e. the >> > user-visible sort identifier) or "attphysnum" (i.e. the order of >> > attributes as stored on disk). >> >> I thought we'd concluded that we really need three values: attnum should >> be a permanent logical ID for each column, and then the user-visible >> column order would be determined by a different number, and the on-disk >> column order by a third. If we're going to do this at all, it seems >> like a seriously bad idea to only go halfway, because then we'll just >> have to revisit all the same code again later. > > Yeah, I was unclear -- that's what I'm doing (or, rather, attempting to > do). Sounds great. While you're doing this, I'd like to think about future requirements, to see if that changes anything. Having a unique logical column id is a great thing because it allows the physical storage to differ. This is the first part to allowing these features... * "column-based storage" where the data for some column(s) lives in a dedicated heap * "vertical partitioning" where defined groups of columns live in separate heaps for performance and/or security * "generated columns" where the column exists only logically and is derived at run-time (per SQL Standard) * "key/value columns" where we retrieve the column value from an hstore * "very large number of columns" for statistical data sets where we automatically vertically partition the heap when faced with large numbers of column definitions So when you store the physical representation please also store a storage method, that currently has just one method SM_HEAP and a relfilenode. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of mié dic 21 09:44:04 -0300 2011: > Sounds great. > > While you're doing this, I'd like to think about future requirements, > to see if that changes anything. > > Having a unique logical column id is a great thing because it allows > the physical storage to differ. This is the first part to allowing > these features... Great ideas. This one I'm not sure about at all: > * "very large number of columns" for statistical data sets where we > automatically vertically partition the heap when faced with large > numbers of column definitions > > So when you store the physical representation please also store a > storage method, that currently has just one method SM_HEAP and a > relfilenode. Well, for the patch I'm working on right now, I'm just going to store an ID as "physical representation", which will mean the sort order used for the on-disk representation of our current heap storage; the idea here is to allow columns to be sorted internally by the system so that alignment padding is reduced; nothing more. Of course, we can work on more complex representations later that allow different storage strategies, such as the ones you propose. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Dec 21, 2011 at 1:42 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > This one I'm not sure about at all: > >> * "very large number of columns" for statistical data sets where we >> automatically vertically partition the heap when faced with large >> numbers of column definitions We currently have pg_attribute.attnum as an int2, so we can store up to 32768 columns without changing that size, as long as we have some place to put the data. Was there something you're working on likely to preventing >240 cols? Just worth documenting what you see at this stage. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of mié dic 21 15:53:20 -0300 2011: > On Wed, Dec 21, 2011 at 1:42 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > > This one I'm not sure about at all: > > > >> * "very large number of columns" for statistical data sets where we > >> automatically vertically partition the heap when faced with large > >> numbers of column definitions > > We currently have pg_attribute.attnum as an int2, so we can store up > to 32768 columns without changing that size, as long as we have some > place to put the data. Hm, right. > Was there something you're working on likely to preventing >240 cols? No, not at all. > Just worth documenting what you see at this stage. I'll keep my eyes open :-) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Tom Lane's message of mar dic 20 22:23:36 -0300 2011: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Excerpts from Tom Lane's message of mar dic 20 18:24:29 -0300 2011: > >> You do *not* want to store either of the latter two numbers in > >> parse-time Var nodes, because then you can't rearrange columns without > >> having to update stored rules. But it might be useful to decree that > >> one thing setrefs.c does is renumber Vars in scan nodes to use the > >> physical column numbers instead of the permanent IDs. > > > Hmm, having the numbers in Var nodes seems a fundamental part of the way > > I'm attacking the problem. Hopefully after I give setrefs.c a read I > > will have a clearer picture of the way to do it without that. > > To clarify a bit: one thing that setrefs.c already does is to renumber > Var nodes above the scan level, so that their attnums refer not to > original table column attnums but to column numbers in the output of the > next plan level down. Vars in scan nodes currently don't need any > renumbering, but it'd be easy enough to extend the logic to do something > to them as well. I'm visualizing the run-time transformation from > physical to logical column ordering as a sort of projection, much like > the mapping that happens in a join node. After more playing with this, it turned out that those logical numbers stored in Var and TargetEntry are actually completely useless; after they served their purpose in helping me track down that I actually needed to sort columns at the RangeTblEntry level, I was able to revert all those bits and things work fine (actually they work better). So far, I have had no need to touch setrefs.c that I see. The reason is that * expansion happens much earlier than setrefs.c is involved, at the parse analysis level; the target lists generated at that point must already follow the logical column order. So that part of the patch becomes this: diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 9e277c5..f640bd8 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -701,6 +701,7 @@ typedef struct RangeTblEntry */ Oid relid; /* OID of the relation */ char relkind; /* relation kind (see pg_class.relkind) */ + List *lognums; /* int list of logical column numbers */ /* * Fields valid for a subquery RTE (elseNULL): Note that the the eref->colnames list is built in logical column order (which is what it should be, because it then matches the alias->colnames list). With all that, it's easy to map the attnums to the logical numbers when the target list is being constructed. And things work fine from that point onwards, because we still keep track of the original attnum to reference the TupleDesc. A RTE in a stored rule looks like this: {RTE :alias <> :eref {ALIAS :aliasname bar :colnames ("z" "y" "x")} :rtekind 0 :relid 16404 :relkind r :lognums (i 3 2 1) :inh true :inFromCl true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9 10 11) :modifiedCols (b)} The original table was created with columns "x, y, z", and then I reversed the order. So if I change the column order in the original table, the rule does not need any change and it continues to return the logical order that the table had when the view was originally defined. (I wonder what the other two RTEs, those named "new" and "old", are for.) One thing I'm finding necessary (for COPY support as well as things that travel through different DestReceivers, such as SQL functions) is that TupleTableSlots need to keep track of logical vs. physical order, and form/deform tuples using the correct ordering. So the values/isnull arrays may be in either order depending on what the caller is doing. At some point a MinimalTuple might be constructed in logical order, for example, and the caller must be aware of this so that it can be deconstructed correctly later on. I mention this so that there's time for bells to ring ... -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support