Thread: Global Sequences
Sequences, as defined by SQL Standard, provide a series of unique values. The current implementation on PostgreSQL isolates the generation mechanism to only a single node, as is common on many RDBMS. For sharded or replicated systems it forces people to various hackish mechanisms in user space for emulating a global or cluster-wide sequence. The solution to this problem is an in-core solution that allows coordination between nodes to guarantee unique values. There are a few options 1) Manual separation of the value space, so that N1 has 50% of possible values and N2 has 50%. That has problems when we reconfigure the cluster, and requires complex manual reallocation of values. So it starts good but ends badly. 2) Automatic separation of the value space. This could mimic the manual operation, so it does everything for you - but thats just making a bad idea automatic 3) Lazy allocation from the value space. When a node is close to running out of values, it requests a new allocation and coordinates with all nodes to confirm the new allocation is good. (3) is similar to the way values are allocated currently, so the only addition is a multi-node allocation algorithm to allocate new value ranges. That seems to be the best way to go. Any implementation for that presumes how the node configuration and inter-node transport works, which we would like to keep open for use by various external tools. So, proposal is to allow nextval() allocation to access a plugin, rather than simply write a WAL record and increment. If the plugin is loaded all sequences call it (not OIDs). We'd call this the Global Sequence API. The API looks like it would be pretty stable to me. We can put something in contrib if required to prove it works, as well as providing some optional caching to further avoid performance effects from being noted. Note that if you did just want to implement manual separation of ranges then this would also make it slightly easier, so this approach supports all flavors, which a more hardcoded solution would not. Any comments before I demonstrate a patch to do this? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10/15/12 2:33 PM, Simon Riggs wrote: > 3) Lazy allocation from the value space. When a node is close to > running out of values, it requests a new allocation and coordinates > with all nodes to confirm the new allocation is good. > > (3) is similar to the way values are allocated currently, so the only > addition is a multi-node allocation algorithm to allocate new value > ranges. That seems to be the best way to go. Any implementation for > that presumes how the node configuration and inter-node transport > works, which we would like to keep open for use by various external > tools. So you're talking about allocating a block of values for each node? Seems straightforwards. That does mean that there needs to be a "master" node in charge of the allocations, though, yes? How would this be tooled/managed? I'd also love to hear from the PostgresXC folks on whether this solution works for them. Postgres-R too. If it works for all three of those tools, it's liable to work for any potential new tool. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Oct 15, 2012 at 2:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Sequences, as defined by SQL Standard, provide a series of unique > values. The current implementation on PostgreSQL isolates the > generation mechanism to only a single node, as is common on many > RDBMS. > > For sharded or replicated systems it forces people to various hackish > mechanisms in user space for emulating a global or cluster-wide > sequence. > > The solution to this problem is an in-core solution that allows > coordination between nodes to guarantee unique values. > > There are a few options > 1) Manual separation of the value space, so that N1 has 50% of > possible values and N2 has 50%. That has problems when we reconfigure > the cluster, and requires complex manual reallocation of values. So it > starts good but ends badly. > 2) Automatic separation of the value space. This could mimic the > manual operation, so it does everything for you - but thats just > making a bad idea automatic > 3) Lazy allocation from the value space. When a node is close to > running out of values, it requests a new allocation and coordinates > with all nodes to confirm the new allocation is good. While useful to some people, it seems like a way to avoid a crazy amount of complexity whereas most people are fine just using uncoordinated 128-bit integers. There are some who want temporal locality and smaller datums, but couldn't the problem be minimized somewhat by presuming non-coordinated identifier generation? It seems like a proper subset of what you propose, so perhaps that's a nice way to bisect the problem. I agree with you that an in-database way to do this -- even if in principle it could be done by clients most of the time -- would lend a lot more cohesion to the system. FWIW, I like "3" otherwise -- much like the foibles of most partitioning schemes and some of the stated design considerations of segment exclusion, I think absolute rigidity at all times makes it really hard to gradually move things into a desired alignment incrementally while the system is online. -- fdr
Josh, * Josh Berkus (josh@agliodbs.com) wrote: > I'd also love to hear from the PostgresXC folks on whether this solution > works for them. Postgres-R too. If it works for all three of those > tools, it's liable to work for any potential new tool. AIUI, PG-XC and PG-R need an order, so they both use an independent system (eg, the PG-XC GTM) to provide that ordering. Again, AIUI, Simon's proposal would not guarantee any ordering but instead would only guarantee non-overlap. Since the proposal being pushed appeared to involve all the complexity of dealing with something like a GTM, by having to have some third system which manages the allocations, figure out what to do if it isn't available, etc, perhaps the requirement to provide an ordering should be added on to it and then the PG-XC GTM simply used for it. If we're not going to have an ordering requirement then I'm not convinced that the pre-allocation approach (where you break the space up into many more blocks than you would ever expect to use and then double-up those blocks on to the same physical system) is a bad solution. It's certainly also been done a number of times, typically quite successfully. Thanks, Stephen
Stephen, > AIUI, PG-XC and PG-R need an order, so they both use an independent > system (eg, the PG-XC GTM) to provide that ordering. You're thinking of XIDs. This is a proposal for user-defined sequences. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
* Josh Berkus (josh@agliodbs.com) wrote: > > AIUI, PG-XC and PG-R need an order, so they both use an independent > > system (eg, the PG-XC GTM) to provide that ordering. > > You're thinking of XIDs. This is a proposal for user-defined sequences. Right, I got that it's a proposal for user sequences. I stand by my comments regarding that proposal- if you're going to have all the complexity of having to talk to some other system to figure out what value(s) you can use, you might as well get an ordering with it. If that's not necessary, then doing pre-allocation is probably sufficient and would provide a direct way, without having to ask any other system, to figure out which system a given value is on. Thanks, Stephen
On Tue, Oct 16, 2012 at 10:30 AM, Josh Berkus <josh@agliodbs.com> wrote:
-- Stephen,You're thinking of XIDs. This is a proposal for user-defined sequences.
> AIUI, PG-XC and PG-R need an order, so they both use an independent
> system (eg, the PG-XC GTM) to provide that ordering.
XC also provides global sequence values with the GTM.
Michael Paquier
http://michael.otacoo.com
On Mon, Oct 15, 2012 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Sequences, as defined by SQL Standard, provide a series of unique > values. The current implementation on PostgreSQL isolates the > generation mechanism to only a single node, as is common on many > RDBMS. I remember constructing at least the thought process surrounding your "option #3" as the one thing thing I arrived at that seemed as though it might be valuable for the many-years-ago Slony-II summit. The notion of having each node give out sections of a sequence seems pretty viable; as a hugely loose analogy, DHCP servers take a somewhat similar approach in assigning IP addresses in ranges shared across those servers. At the time, back in, I think, 2005, there was some agreement that it was a viable idea to give out chunks of sequence range; it wasn't one of the tough problems warranting Heavy Thought, so there wasn't any concentration on it, and that pretty all went by the wayside. Today, I am somewhat skeptical that there's much value to the exercise. It isn't difficult to come up with pretty unique values, between constructing something with a node ID prefix or such, or using a DCE UUID that is very likely to be globally unique. The reason to want a "global sequence" is that it's supposed to give out values in pretty much a sequential order. But a "shared global sequence" will have aspects of that that are decidedly *not* in sequential order. If it's partially *un*ordered, I suspect this undermines the value of it. There's a necessary trade-off; you can either have it globally *strongly* ordered, and, if so, you'll have to pay a hefty coordination price, or you can have the cheaper answer of a weakly ordered sequence. The latter leaves me feeling rather "meh." -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
* Christopher Browne (cbbrowne@gmail.com) wrote: > There's a necessary trade-off; you can either have it globally > *strongly* ordered, and, if so, you'll have to pay a hefty > coordination price, or you can have the cheaper answer of a weakly > ordered sequence. The latter leaves me feeling rather "meh." If all the systems involved are local to the system giving out the sequences, ala PG-XC's GTM, I don't believe it's really all *that* expensive.. Thanks, Stephen
On 16 October 2012 03:03, Christopher Browne <cbbrowne@gmail.com> wrote: > There's a necessary trade-off; you can either have it globally > *strongly* ordered, and, if so, you'll have to pay a hefty > coordination price, or you can have the cheaper answer of a weakly > ordered sequence. The latter leaves me feeling rather "meh." Oracle allows you to define whether you want ORDER or not for a sequence when used in clustered mode. Requesting a sequence to be strongly ordered across a generic distributed system is very much like asking performance=none and high_availability=off, which is why I didn't suggest it. So you're right about the "hefty coordination price" but our conclusions differ because of our understanding of that price. I don't think it makes sense to spend the time implementing that option. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-10-15 23:33, Simon Riggs wrote: > So, proposal is to allow nextval() allocation to access a plugin, > rather than simply write a WAL record and increment. If the plugin is > loaded all sequences call it (not OIDs). +1. It is currently impossible to alter nextvals behaviour, without making changes in core. It is possible to define an alternative implementation and try to force to use it by using the search_path, but serial datatypes are always bound to pg_catalog.nextval(). This would enable every distributed PostgreSQL system to make a cleaner implementation for global sequences than they currently have, and would also encourage reuse of distributed nextval implementations. regards, Yeb -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data
On 10/16/2012 12:47 AM, Josh Berkus wrote: > I'd also love to hear from the PostgresXC folks on whether this solution > works for them. Postgres-R too. In Postgres-R, option 3) is implemented. Though, by default sequences work just like on a single machine, giving you monotonically increasing sequence values - independent from the node you call nextval() from. IMO that's the user's expectation. (And yes, this has a performance penalty. But no, there's no compromise in availability). It is implemented very much like the per-backend cache we already have in vanilla Postgres, but taken to the per-node level. This gives the user a nice compromise between strongly ordered and entirely random values, allowing fine-tuning the trade off between performance and laziness in the ordering (think of CACHE 10 vs. CACHE 10000). > If it works for all three of those > tools, it's liable to work for any potential new tool. In Postgres-R, this per-node cache uses additional attributes in the pg_sequence system catalog to store state of this cache. This is something I'm sure is not feasible to do from within a plugin. Why does a "Global Sequences" API necessarily hook at the nextval() and setval() level? That sounds like it yields an awkward amount of duplicate work. Reading this thread, so far it looks like we agree that option 3) is the most feasible optimization (the strict ordering being the un-optimized starting point). Do we really need an API that allows for implementations of options 1) and 2)? What I'd appreciate more is a common implementation for option 3) with an API to plug in different solutions to the underlying consensus problem. Regards Markus Wanner
On 16 October 2012 13:26, Markus Wanner <markus@bluegap.ch> wrote: > Why does a "Global Sequences" API necessarily hook at the nextval() and > setval() level? That sounds like it yields an awkward amount of > duplicate work. Reading this thread, so far it looks like we agree that > option 3) is the most feasible optimization (the strict ordering being > the un-optimized starting point). Do we really need an API that allows > for implementations of options 1) and 2)? Where else would you put the hook? The hook's location as described won't change whether you decide you want 1, 2 or 3. > What I'd appreciate more is a common implementation for option 3) with > an API to plug in different solutions to the underlying consensus problem. Implementations will be similar, differing mostly in the topology and transport layer, which means its not going to be possible to provide such a thing initially without slowing it down to the point we don't actually get it at all. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/15/12 5:33 PM, Simon Riggs wrote: > There are a few options > 1) Manual separation of the value space, so that N1 has 50% of > possible values and N2 has 50%. That has problems when we reconfigure > the cluster, and requires complex manual reallocation of values. So it > starts good but ends badly. > 2) Automatic separation of the value space. This could mimic the > manual operation, so it does everything for you - but thats just > making a bad idea automatic > 3) Lazy allocation from the value space. When a node is close to > running out of values, it requests a new allocation and coordinates > with all nodes to confirm the new allocation is good. What would the allocation service look like? Is it another PostgreSQL server? What's the communication protocol? How would backups work? Crash recovery? Option 4 is of course to use UUIDs.
On 10/16/2012 08:54 AM, Peter Eisentraut wrote: > > Option 4 is of course to use UUIDs. > > Yeah, I was wondering what this would really solve that using UUIDs wouldn't solve. cheers andrew
On Tuesday, October 16, 2012 02:58:11 PM Andrew Dunstan wrote: > On 10/16/2012 08:54 AM, Peter Eisentraut wrote: > > Option 4 is of course to use UUIDs. > > Yeah, I was wondering what this would really solve that using UUIDs > wouldn't solve. Large indexes over random values perform notably worse than mostly/completely ordered ones as they can be perfectly packed. Beside the fact that uuids have 2/4 times the storage overhead of int4/int8. That has influences both in query and modification performance. Also, not allowing plainly numeric pk's makes porting an application pretty annoying... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 13:54, Peter Eisentraut <peter_e@gmx.net> wrote: > On 10/15/12 5:33 PM, Simon Riggs wrote: >> There are a few options >> 1) Manual separation of the value space, so that N1 has 50% of >> possible values and N2 has 50%. That has problems when we reconfigure >> the cluster, and requires complex manual reallocation of values. So it >> starts good but ends badly. >> 2) Automatic separation of the value space. This could mimic the >> manual operation, so it does everything for you - but thats just >> making a bad idea automatic >> 3) Lazy allocation from the value space. When a node is close to >> running out of values, it requests a new allocation and coordinates >> with all nodes to confirm the new allocation is good. > > What would the allocation service look like? Is it another PostgreSQL > server? What's the communication protocol? How would backups work? > Crash recovery? I've proposed a plugin for the allocation only. So the allocation looks like anything you want. Crash recovery and backups would not need changes. > Option 4 is of course to use UUIDs. That is a user level option. If user chooses sequences, then we must support them. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > So, proposal is to allow nextval() allocation to access a plugin, > rather than simply write a WAL record and increment. If the plugin is > loaded all sequences call it (not OIDs). I think this is a fundamentally wrong way to go about doing what you want to do. It presumes that DDL-level manipulation of global sequences is exactly like local sequences; an assumption that is obviously false. What you really want is something vaguely like nextval but applied to a distinct type of object. That is, I think we first need a different kind of object called a "global sequence" with its own DDL operations. The nearby thread about "message queue" objects seems rather closely related. Perhaps it would be fruitful to think about the commonalities involved in two (or more?) new relkinds for global objects. regards, tom lane
On 16 October 2012 15:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> So, proposal is to allow nextval() allocation to access a plugin, >> rather than simply write a WAL record and increment. If the plugin is >> loaded all sequences call it (not OIDs). > > I think this is a fundamentally wrong way to go about doing what you > want to do. It presumes that DDL-level manipulation of global sequences > is exactly like local sequences; an assumption that is obviously false. > What you really want is something vaguely like nextval but applied to > a distinct type of object. That is, I think we first need a different > kind of object called a "global sequence" with its own DDL operations. > > The nearby thread about "message queue" objects seems rather closely > related. Perhaps it would be fruitful to think about the commonalities > involved in two (or more?) new relkinds for global objects. The message queue concept doesn't exist at all yet, so when we create it we can specify anything we want. That is a different situation and hence a different solution. CREATE SEQUENCE is SQL Standard and used by SERIAL, many people's SQL, SQL generation tools etc.. My objective is to come up with something that makes the standard code work correctly in a replicated environment. If rewriting the application was acceptable, we could just do as Peter suggests and say "use UUIDs". Many other people who think rewriting everything is OK spell that "CouchDB" etc.. But that doesn't solve the problem at hand, which is making existing things work, rather than force people to rethink and rewrite. So CREATE GLOBAL SEQUENCE as a new kind of object altogether wouldn't solve the problem I'm trying to address. I guess we could use a decoration syntax on a sequence, like this... ALTER SEQUENCE foo GLOBAL or ALTER SEQUENCE foo ALLOCATION FUNCTION myglobalalloc(); -- if we cared to specify the alloc function on a per object basis. or ALTER SEQUENCE foo WITH (allocation=global); So that we can explicitly specify which sequences to replicate globally and which locally. We'd need to have a userset GUC default_sequence_allocation = local (default) | global so that SERIAL and other new sequences could pick up the required behaviour when created. In any case, we need to have a plugin/user definable allocation function for next few years at least. Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 16 October 2012 15:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think this is a fundamentally wrong way to go about doing what you >> want to do. It presumes that DDL-level manipulation of global sequences >> is exactly like local sequences; an assumption that is obviously false. > The message queue concept doesn't exist at all yet, so when we create > it we can specify anything we want. That is a different situation and > hence a different solution. CREATE SEQUENCE is SQL Standard and used > by SERIAL, many people's SQL, SQL generation tools etc.. My objective > is to come up with something that makes the standard code work > correctly in a replicated environment. I challenge you to find anything in the SQL standard that suggests that sequences have any nonlocal behavior. If anything, what you propose violates the standard, it doesn't make us follow it more closely. Furthermore, I find it hard to believe that people won't want both local and global sequences in the same database --- so one way or the other we need some DDL-level reflection of the difference. A larger point though is that the various implementation choices you mentioned probably need to be configurable by DDL options. I doubt that it will work well to say "install plugin A to get behavior X, or install plugin B to get behavior Y, and whichever you choose is not further configurable, it'll be the same behavior for all sequences". So I fully expect that we're going to need something different from bog-standard CREATE SEQUENCE. Exactly what isn't clear --- but I think modifying nextval's behavior is way down the list of concerns. regards, tom lane
On 10/16/12 9:20 AM, Simon Riggs wrote: > I've proposed a plugin for the allocation only. So the allocation > looks like anything you want. Are you planning to provide a reference implementation of some kind?
On Tue, Oct 16, 2012 at 5:54 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 10/15/12 5:33 PM, Simon Riggs wrote: >> There are a few options >> 1) Manual separation of the value space, so that N1 has 50% of >> possible values and N2 has 50%. That has problems when we reconfigure >> the cluster, and requires complex manual reallocation of values. So it >> starts good but ends badly. >> 2) Automatic separation of the value space. This could mimic the >> manual operation, so it does everything for you - but thats just >> making a bad idea automatic >> 3) Lazy allocation from the value space. When a node is close to >> running out of values, it requests a new allocation and coordinates >> with all nodes to confirm the new allocation is good. > > What would the allocation service look like? Is it another PostgreSQL > server? What's the communication protocol? How would backups work? > Crash recovery? As a reasonable proxy to look at the first question, one may look at how twitter uses their home-grown software snowflake. https://github.com/twitter/snowflake A colleague, Blake Mizerany, wrote a smaller version called "noeqd", based on the same ideas, but he wanted something with fewer dependencies. Unless you are very Java-library-and-toolchain adept you might find this more fun to play with. https://github.com/bmizerany/noeqd -- fdr
On 16 October 2012 17:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 16 October 2012 15:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think this is a fundamentally wrong way to go about doing what you >>> want to do. It presumes that DDL-level manipulation of global sequences >>> is exactly like local sequences; an assumption that is obviously false. > >> The message queue concept doesn't exist at all yet, so when we create >> it we can specify anything we want. That is a different situation and >> hence a different solution. CREATE SEQUENCE is SQL Standard and used >> by SERIAL, many people's SQL, SQL generation tools etc.. My objective >> is to come up with something that makes the standard code work >> correctly in a replicated environment. > > I challenge you to find anything in the SQL standard that suggests that > sequences have any nonlocal behavior. No need for challenge, I agree, the SQL standard doesn't speak about that. I didn't say it did. > If anything, what you propose > violates the standard, And so that doesn't follow, but anyway... > it doesn't make us follow it more closely. I wasn't arguing that my proposal did that (made us follow standard more closely). My point is that application code exists that expects sequences to Just Work, and so the aim of the proposal was to do that in a replicated environment as well as single node. > Furthermore, I find it hard to believe that people won't want both > local and global sequences in the same database --- so one way or the > other we need some DDL-level reflection of the difference. Agreed, which is why I proposed some DDL-level syntax. Was that OK? > A larger point though is that the various implementation choices you > mentioned probably need to be configurable by DDL options. I doubt that > it will work well to say "install plugin A to get behavior X, or install > plugin B to get behavior Y, and whichever you choose is not further > configurable, it'll be the same behavior for all sequences". Again, I accept that as of my last post, and I proposed syntax to provide it. > So I fully > expect that we're going to need something different from bog-standard > CREATE SEQUENCE. There's no point in that at all, as explained. It's sequences that need to work. We can already call my_nextval() rather than nextval() if we want a roll-your own sequence facility and can rewrite applications to call that, assuming UUID isn't appropriate. Please don't force people to rewrite their applications; it might not go in the direction we want. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 17:17, Peter Eisentraut <peter_e@gmx.net> wrote: > On 10/16/12 9:20 AM, Simon Riggs wrote: >> I've proposed a plugin for the allocation only. So the allocation >> looks like anything you want. > > Are you planning to provide a reference implementation of some kind? I'll provide hooks and a stub for testing. A full implementation relies upon the physical transport used. For BDR, there will be a full working version that relies upon that. TPL, PGDG. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 16 October 2012 17:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So I fully >> expect that we're going to need something different from bog-standard >> CREATE SEQUENCE. > There's no point in that at all, as explained. It's sequences that > need to work. We can already call my_nextval() rather than nextval() > if we want a roll-your own sequence facility and can rewrite > applications to call that, assuming UUID isn't appropriate. I wasn't objecting to the concept of allowing nextval() to have overloaded behaviors; more saying that that wasn't where to start the design process. In particular, the reason proposing a hook first seems backwards is that if we have a catalog-level representation that some sequences are local and others not, we should be using that to drive the determination of whether to call a substitute function --- and maybe which one to call. For instance, I could see attaching a function OID to each sequence and then having nextval() call that function, instead of a hook per se. Or maybe better, invent a level of indirection like a "sequence access method" (comparable to index access methods) that provides a compatible set of substitute functions for sequence operations. If you want to override nextval() for a sequence, don't you likely also need to override setval(), currval(), etc? Not to mention overriding ALTER SEQUENCE's behavior. regards, tom lane
Simon, On 10/16/2012 02:36 PM, Simon Riggs wrote: > Where else would you put the hook? The hook's location as described > won't change whether you decide you want 1, 2 or 3. You assume we want an API that supports all three options. In that case, yes, the hooks need to be very general. Given that option 3 got by far the most support, I question whether we need such a highly general API. I envision an API that keeps the bookkeeping and cache lookup functionality within Postgres. So we have a single, combined-effort, known working implementation for that. What remains to be done within the plugin effectively is the consensus problem: it all boils down to the question of which node gets the next chunk of N sequence numbers. Where N can be 1 (default CACHE setting in Postgres) or any higher number for better performance (reduces the total communication overhead by a factor of N - or at least pretty close to that, if you take into account "lost" chucks due to node failures). A plugin providing that has to offer a method to request for a global ordering and would have to trigger a callback upon reaching consensus with other nodes on who gets the next chunk of sequence numbers. That works for all N >= 1. And properly implements option 3 (but doesn't allow implementations of options 1 or 2, which I claim we don't need, anyway). > Implementations will be similar, differing mostly in the topology and > transport layer I understand that different users have different needs WRT transport layers - moving the hooks as outlined above still allows flexibility in that regard. What different topologies do you have in mind? I'd broadly categorize this all as multi-master. Do you need finer grained differentiation? Or do you somehow include slaves (i.e. read-only transactions) in this process? As you yourself are saying, implementations will only differ in that way, let's keep the common code the same. And not require plugins to duplicate that. (This also allows us to use the system catalogs for book keeping, as another benefit). > which means its not going to be possible to provide > such a thing initially without slowing it down to the point we don't > actually get it at all. Sorry, I don't quite understand what you are trying to say, here. Overall, thanks for bringing this up. I'm glad to see something happening in this area, after all. Regards Markus Wanner
Tom, On 10/16/2012 06:15 PM, Tom Lane wrote: > I challenge you to find anything in the SQL standard that suggests that > sequences have any nonlocal behavior. If anything, what you propose > violates the standard, it doesn't make us follow it more closely. If you look at a distributed database as a transparent equivalent of a single-node system, I'd say the SQL standard applies to the entire distributed system. From that point of view, I'd rather argue that any "local-only" behavior violates the standard. Regards Markus Wanner
On 17 October 2012 09:10, Markus Wanner <markus@bluegap.ch> wrote: > Simon, > > On 10/16/2012 02:36 PM, Simon Riggs wrote: >> Where else would you put the hook? The hook's location as described >> won't change whether you decide you want 1, 2 or 3. > > You assume we want an API that supports all three options. In that case, > yes, the hooks need to be very general. I'm not assuming that, so much of what you say is moot, though it is good and welcome input. > Given that option 3 got by far the most support, I question whether we > need such a highly general API. I envision an API that keeps the > bookkeeping and cache lookup functionality within Postgres. So we have a > single, combined-effort, known working implementation for that. IMHO an API is required for "give me the next allocation of numbers", essentially a bulk equivalent of nextval(). Anything lower level is going to depend upon implementation details that I don't think we should expose. I'm sure there will be much commonality between 2 similar implementations, just as there is similar code in each index type. But maintaining modularity is important and ahead of us actually seeing 2 implementations, trying to prejudge that is going to slow us all down and likely screw us up. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon, On 10/17/2012 10:34 AM, Simon Riggs wrote: > IMHO an API is required for "give me the next allocation of numbers", > essentially a bulk equivalent of nextval(). Agreed. That pretty exactly matches what I described (and what's implemented in Postgres-R). The API then only needs to be called every N invocations of nextval(), because otherwise nextval() can simply return a cached number previously allocated in a single step, eliminating a lot of the communication overhead. You realize an API at that level doesn't allow for an implementation of options 1 and 2? (Which I'm convinced we don't need, so that's fine with me). > Anything lower level is going to depend upon implementation details > that I don't think we should expose. Exactly. Just like we shouldn't expose other implementation details, like writing to system catalogs or WAL. > I'm sure there will be much commonality between 2 similar > implementations, just as there is similar code in each index type. But > maintaining modularity is important and ahead of us actually seeing 2 > implementations, trying to prejudge that is going to slow us all down > and likely screw us up. Agreed. Let me add, that modularity only serves a purpose, if the boundaries between the modules are chosen wisely. It sounds like we are on the same page, though. To testify this: IMHO an API for setval() is required to invalidate all node's caches and re-set an initial value, as a starting point for the next bulk of numbers that nextval() will return. currval() doesn't need to be changed or "hooked" at all, because it's a read-only operation. Regards Markus Wanner
Simon Riggs <simon@2ndQuadrant.com> writes: > On 16 October 2012 15:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What you really want is something vaguely like nextval but applied to >> a distinct type of object. That is, I think we first need a different >> kind of object called a "global sequence" with its own DDL operations. >> > hence a different solution. CREATE SEQUENCE is SQL Standard and used > by SERIAL, many people's SQL, SQL generation tools etc.. My objective > is to come up with something that makes the standard code work > correctly in a replicated environment. I think we still can have both. I like Tom's suggestion better, as it provides for a cleaner implementation in the long run, I think. Now, the way I see how to get a GLOBAL SEQUENCE by default when creating a SERIAL column would be with an Event Trigger. To get there, we need to implement some kind of "INSTEAD OF" Event Trigger, and the good news is that we only need to do that in a very specialized command, not as a generic facility. At least as a starter. CREATE EVENT TRIGGER global_sequences_please ON ddl_create_sequence WHEN context in('generated') EXECUTE PROCEDURE create_global_sequence_instead_thanks(); That would take care of any SERIAL or BIGSERIAL column creation and leave alone manual CREATE SEQUENCE commands, as those would have a context of 'toplevel' as opposed to 'generated'. This context exposing and filtering is included in my current patch for Event Triggers that I'm submitting quite soon to the next commit fest. We still need to implement the "ddl_create_sequence" event that only fires before create sequence and refrain from creating a sequence if it did have an event trigger attached, whatever that did. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 17 October 2012 11:21, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 16 October 2012 15:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What you really want is something vaguely like nextval but applied to >>> a distinct type of object. That is, I think we first need a different >>> kind of object called a "global sequence" with its own DDL operations. >>> >> hence a different solution. CREATE SEQUENCE is SQL Standard and used >> by SERIAL, many people's SQL, SQL generation tools etc.. My objective >> is to come up with something that makes the standard code work >> correctly in a replicated environment. > > I think we still can have both. I like Tom's suggestion better, as it > provides for a cleaner implementation in the long run, I think. Not sure how it is cleaner when we have to have trigger stuff hanging around to make one object pretend to be another. That also creates a chain of dependency which puts this into the future, rather than now. The goal is make-sequences-work, not to invent something new that might be cooler or more useful. If we create something new, then we need to consider the references Daniel described, but that is a whole different thing and already accessible if you need/want that. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 October 2012 10:19, Markus Wanner <markus@bluegap.ch> wrote: > On 10/17/2012 10:34 AM, Simon Riggs wrote: >> IMHO an API is required for "give me the next allocation of numbers", >> essentially a bulk equivalent of nextval(). > > Agreed. That pretty exactly matches what I described (and what's > implemented in Postgres-R). The API then only needs to be called every N > invocations of nextval(), because otherwise nextval() can simply return > a cached number previously allocated in a single step, eliminating a lot > of the communication overhead. > > You realize an API at that level doesn't allow for an implementation of > options 1 and 2? (Which I'm convinced we don't need, so that's fine with > me). > >> Anything lower level is going to depend upon implementation details >> that I don't think we should expose. > > Exactly. Just like we shouldn't expose other implementation details, > like writing to system catalogs or WAL. > >> I'm sure there will be much commonality between 2 similar >> implementations, just as there is similar code in each index type. But >> maintaining modularity is important and ahead of us actually seeing 2 >> implementations, trying to prejudge that is going to slow us all down >> and likely screw us up. > > Agreed. Let me add, that modularity only serves a purpose, if the > boundaries between the modules are chosen wisely. It sounds like we are > on the same page, though. > > To testify this: IMHO an API for setval() is required to invalidate all > node's caches and re-set an initial value, as a starting point for the > next bulk of numbers that nextval() will return. > > currval() doesn't need to be changed or "hooked" at all, because it's a > read-only operation. Agreed.... API calls for sam_nextval_alloc() and sam_setval() using Tom's Sequence Access Method naming. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 18:29, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Or maybe better, invent a level of indirection like a "sequence access > method" (comparable to index access methods) that provides a compatible > set of substitute functions for sequence operations. If you want to > override nextval() for a sequence, don't you likely also need to > override setval(), currval(), etc? Not to mention overriding ALTER > SEQUENCE's behavior. Agreed, though with exact API as discussed on portion of thread with Markus. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > Not sure how it is cleaner when we have to have trigger stuff hanging > around to make one object pretend to be another. That also creates a > chain of dependency which puts this into the future, rather than now. Yes, that part isn't cleaner at all. The part where we have a separate Object to deal with I like better, and I tried to reconciliate the two view points. Note that the event trigger would come installed and disabled, the user would only have to activate it: ALTER EVENT TRIGGER distribute_my_cluster ENABLE; Still the same issue. > The goal is make-sequences-work, not to invent something new that > might be cooler or more useful. If we create something new, then we > need to consider the references Daniel described, but that is a whole > different thing and already accessible if you need/want that. So ok, I withdraw my consensus proposal. I tried. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Oct 16, 2012 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In particular, the reason proposing a hook first seems backwards is that > if we have a catalog-level representation that some sequences are local > and others not, we should be using that to drive the determination of > whether to call a substitute function --- and maybe which one to call. > For instance, I could see attaching a function OID to each sequence > and then having nextval() call that function, instead of a hook per se. Yeah, I like that. That makes it easy to configure your database so that some sequences have special behavior (which the database designer can set up however they like) and others can be just vanilla, and the plugin doesn't have to try to figure out which ones are which (which was my first concern in reading Simon's original proposal). To make it even better, add some generic options that can be passed through to the underlying handler. So something like: ALTER SEQUENCE wump SET HANDLER (nextval my_magical_nextval, setval my_magical_setval) OPTIONS (any_label_you_want_the_handlers_to_get 'some_text_associated_with_the_label', another_label 'some_more_text'); That way you could say, for example, that sequence wump should get its values from coordinator node 172.24.16.93 and that the global identifier for this sequence is UUID e15ea6e6-43d5-4f65-8efd-cf28a14a2d70. That way you can avoid having to make any assumptions about how local sequence names on particular nodes are mapped onto global names. > Or maybe better, invent a level of indirection like a "sequence access > method" (comparable to index access methods) that provides a compatible > set of substitute functions for sequence operations. If you want to > override nextval() for a sequence, don't you likely also need to > override setval(), currval(), etc? Not to mention overriding ALTER > SEQUENCE's behavior. This might be better, but it's also possibly more mechanism than we truly need here. But then again, if we're going to end up with more than a handful of handlers, we probably do want to do this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Oct 16, 2012 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Or maybe better, invent a level of indirection like a "sequence access >> method" (comparable to index access methods) that provides a compatible >> set of substitute functions for sequence operations. If you want to >> override nextval() for a sequence, don't you likely also need to >> override setval(), currval(), etc? Not to mention overriding ALTER >> SEQUENCE's behavior. > This might be better, but it's also possibly more mechanism than we > truly need here. But then again, if we're going to end up with more > than a handful of handlers, we probably do want to do this. It's definitely a lot of mechanism, and if we can get away with something simpler that's fine with me. But I'd want to see a pretty bulletproof argument why overriding *only* nextval is sufficient (and always will be) before accepting a hook for just nextval. If we build an equivalent amount of functionality piecemeal it's going to be a lot uglier than if we recognize we need this type of concept up front. regards, tom lane
On 18 October 2012 16:08, Robert Haas <robertmhaas@gmail.com> wrote: > To make > it even better, add some generic options that can be passed through to > the underlying handler. Agreed >> Or maybe better, invent a level of indirection like a "sequence access >> method" (comparable to index access methods) that provides a compatible >> set of substitute functions for sequence operations. If you want to >> override nextval() for a sequence, don't you likely also need to >> override setval(), currval(), etc? Not to mention overriding ALTER >> SEQUENCE's behavior. > > This might be better, but it's also possibly more mechanism than we > truly need here. But then again, if we're going to end up with more > than a handful of handlers, we probably do want to do this. Let me have a play and see what comes out simplest. Somewhere in the middle seems about right. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 18 October 2012 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But I'd want to see a pretty > bulletproof argument why overriding *only* nextval is sufficient > (and always will be) before accepting a hook for just nextval. If we > build an equivalent amount of functionality piecemeal it's going to > be a lot uglier than if we recognize we need this type of concept > up front. We discussed that we need only nextval() and setval() elsewhere, but adding others is fairly cheap so we can chew on that when we have a patch to discuss. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services