Re: On partitioning - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: On partitioning |
Date | |
Msg-id | 20141114021201.GN28859@tamriel.snowman.net Whole thread Raw |
In response to | Re: On partitioning (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: On partitioning
|
List | pgsql-hackers |
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote: > > but > > with declarative partitioning, I expect us to eventually be able to > > eliminate complete partitions from consideration on both sides of a > > partition-table join and optimize cases where we have two partitioned > > tables being joined with a compatible join key and only actually do > > joins between the partitions which overlap each other. I don't see > > those happening if we're allowing a node tree (only). If having a node > > tree is just one option among other partitioning options, then we can > > provide users with the ability to choose what suits their particular > > needs. > > This seems completely muddled to me. What we're talking about is how > to represent the partition definition in the system catalogs. I'm not > proposing that the user would "partition by pg_node_tree"; what the > heck would that even mean? They'd provide an expression which would be able to identify the partition to be used. In a way, this is exactly how many folks do partitioning today with inheritence- consider the if/else trees in triggers for handling new data coming into the parent table. That's also why it wouldn't be easy to optimize for. > I'm proposing one way of serializing the > partition definitions that the user specifies into something that can > be stored into a system catalog, which happens to reuse the existing > infrastructure that we use for that same purpose in various other > places. Ok, I didn't immediately see how a node tree would be used for this- but I admit that I've not gone back through the entirety of this iteration of the partitioning discussion. > I don't have a problem with somebody coming up with another > way of representing the data in the catalogs; I'm just brainstorming. Ditto. > But saying that we'll be able to optimize joins better if we store the > same data as anyarray rather than pg_node_tree or visca versa doesn't > make any sense at all. Ok, if the node tree is constrained in what can be stored in it then I understand how we could still use optimize based on what we've stored in it. I'm not entirely sure a node tree makes sense but at least I understand better. > > I'm not a fan of using pg_class- there are a number of columns in there > > which I would *not* wish to be allowed to be different per partition > > (starting with relowner and relacl...). Making those NULL would be just > > as bad (probably worse, really, since we'd also need to add new columns > > to pg_class to indicate the partitioning...) as having a sparsely > > populated new catalog table. > > I think you are, again, confused as to what we're discussing. Nobody, > including Alvaro, has proposed a design where the individual > partitions don't have pg_class entries of some kind. What we're > talking about is where to store the metadata for partition exclusion > and tuple routing. This discussion has gone a few rounds before and, yes, I was just jumping into the middle of this particular round, but I'm pretty sure I'm not the first to point out that storing the individual partition information into pg_class isn't ideal since there are pieces that we don't actually want to be different per partition, as I outlined previously. Perhaps what that means is we should actually go the other way and move *those* columns into a new catalog instead. Consider this (totally off-the-cuff): pg_relation (pg_tables? pg_heaps?) oid relname relnamespace reltype reloftype relowner relam (?) relhas* relisshared relpersistencerelkind (?) relnatts relchecks relacl reloptions relhowpartitioned (?) pg_class pg_relation.oid relfilenode reltablespace relpages reltuples reltoastrelid reltoastidxid relfrozenxid relhasindexes(?) relpartitioninfo (whatever this ends up being) The general idea being to seperate the user-facing notion of a "table" from the underlying implementation, with the implementation allowing multiple sets of files to be used for each table. It's certainly not for the faint of heart, but we saw what happened with our inheiritance structure allowing different permissions on the child tables- we ended up creating a pretty grotty hack to deal with it (going through the parent bypasses the permissions). That's the best solution for that situation, but it's far from ideal and it'd be nice to avoid that same risk with partitioning. Additionally, if each partition is in pg_class, how are we handling name conflicts? Why do individual partitions even need to have a name? Do we allow queries against them directly? etc.. These are just my thoughts on it and I really don't intend to derail progress on having a partitioning system and I hope that my comments don't lead to that happening. Thanks, Stephen
pgsql-hackers by date: