Thread: small smgrcreate cleanup patch
smgrcreate() currently contains a call to TablespaceCreateDbspace(). As the comment says, this is a rather silly place to put it. The silliest thing about it, IMHO, is that it forces the following check to be done in both smgrcreate() and mdcreate(): if (isRedo && reln->md_fd[forknum] != NULL) return; So I propose moving the TablespaceCreateDbspace() call to mdcreate(), removing the redundant check from smgrcreate(), and deleting that portion of the comment which says this is in the wrong place. You could argue that perhaps md.c isn't the right place either, but it certainly makes more sense than smgr.c, and I'd argue it's exactly right. Moving the TablespaceCreateDbspace() logic into md.c (or smgr.c) seems like it would be more awkward, though I suppose it could be done. One less-than-thrilling result would be that the TablespaceCreateLock logic wouldn't be confined to tablespace.c, not that that's *necessarily* a disaster. A related, interesting question is whether there's any purpose to the smgr layer at all. Would we accept a patch that implemented an alternative smgr layer, perhaps on a per-tablespace basis? The only real alternative I can imagine to "magnetic disk" storage would be network storage. You could imagine using such a thing for temporary tablespaces, essentially writing out temporary table pages to a RAM cache on a remote machine; or perhaps more interestingly, using it for fault tolerance, keeping 2 or 3 copies of each page on different servers. Maybe someone will say "hey, just use iSCSI" or "hey, just use AFS" or something like that, but I dunno if I trust them to do the right thing with fsync, and they might not be as well-optimized as would be possible if we rolled our own. At any rate, if we DON'T think we'd ever consider supporting something like this, perhaps we ought to just fold the md.c stuff into smgr.c and eliminate all the jumping through hoops. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > So I propose moving the TablespaceCreateDbspace() call to mdcreate(), > removing the redundant check from smgrcreate(), and deleting that > portion of the comment which says this is in the wrong place. While I don't care for having smgr.c call tablespace.c, moving the call to md.c instead is surely not an improvement :-(. The problem here is that we'd like the tablespace code to be above the smgr code, not below. Calling it from md.c makes the layer inversion worse not better. > You could argue that perhaps md.c isn't the right place either, but it > certainly makes more sense than smgr.c, and I'd argue it's exactly > right. On what grounds pray tell? > A related, interesting question is whether there's any purpose to the > smgr layer at all. Not a lot anymore, I think. This has been ranted about before, but I think the Berkeley guys bet on the wrong horse when they put an API layer here. The facilities that might once have been usefully switched between at this level are now all down inside kernel device drivers. I could see merging smgr.c and md.c entirely. But they'd still be appropriately placed below, not above, the tablespace commands. regards, tom lane
On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > A related, interesting question is whether there's any purpose to the > smgr layer at all. Would we accept a patch that implemented an > alternative smgr layer, perhaps on a per-tablespace basis? I definitely want to keep it. I think we could usefully do an application-level raid implementation. It would be useful for people running on machines where they don't have administrative access on the machine. In particular I'm picturing shared cluster machines that run other jobs and can't be reconfigured specifically for the database. Adding per-tablespace behaviour would make the argument a lot stronger too since it's not so easy to set up different stripe sizes per table if you're using OS level raid. I also have various crazy plans to experiment with network-based storage and had intended to use smgr to do so. At google we have a bunch of different storage technologies and they're all application-level network services. You can always implement a fuse module that calls back up to a daemon which acts as the client but that doesn't make me feel any happier about it. And I know EDB has their infinicache thing using memcached -- I don't recall if it uses the smgr layer but it would certainly be a natural place to hook it in. I guess my point here is that regardless of whether we plan on accepting any such patches in core it's a very handy hook for third parties to extend postgres with. It would be nice if we had some of those modules in contrib to keep us honest with the api but even as it stands I think it's useful. -- greg
On 20/08/10 15:45, Greg Stark wrote: > On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas<robertmhaas@gmail.com> wrote: >> A related, interesting question is whether there's any purpose to the >> smgr layer at all. Would we accept a patch that implemented an >> alternative smgr layer, perhaps on a per-tablespace basis? > > I definitely want to keep it. > > I think we could usefully do an application-level raid implementation. > It would be useful for people running on machines where they don't > have administrative access on the machine. In particular I'm picturing > shared cluster machines that run other jobs and can't be reconfigured > specifically for the database. Adding per-tablespace behaviour would > make the argument a lot stronger too since it's not so easy to set up > different stripe sizes per table if you're using OS level raid. > > I also have various crazy plans to experiment with network-based > storage and had intended to use smgr to do so. At google we have a > bunch of different storage technologies and they're all > application-level network services. You can always implement a fuse > module that calls back up to a daemon which acts as the client but > that doesn't make me feel any happier about it. I think you would be better off implementing that somewhere in md.c, or even file.c. The smgr layer has been dead for such a long time that it would take a significant amount of work to make it usable again. The whole infrastructure to handle fsyncs() at checkpoints, for example, is in md.c, so you'd need to rewrite all that. > And I know EDB has their infinicache thing using memcached -- I don't > recall if it uses the smgr layer but it would certainly be a natural > place to hook it in. FWIW, it doesn't. It's in bufmgr, it needs some co-operation from the buffer cache. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Aug 19, 2010 at 11:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > While I don't care for having smgr.c call tablespace.c, moving the call to > md.c instead is surely not an improvement :-(. The problem here is that > we'd like the tablespace code to be above the smgr code, not below. > Calling it from md.c makes the layer inversion worse not better. > >> You could argue that perhaps md.c isn't the right place either, but it >> certainly makes more sense than smgr.c, and I'd argue it's exactly >> right. > > On what grounds pray tell? If smgr wants to even have the pretense of being an abstraction layer, it can't very well know about the underlying file system structure. But there's no getting around the fact that md.c has to know that stuff; it has to create and write those files. There is, perhaps, a layer inversion problem here, but if anything I think it's that the functionality of tablespace.c spans everything from the SQL layer all the way down to pushing bits in the filesystem. But that's not really the fault of smgr.c/md.c. Perhaps tablespace.c shouldn't assume anything about the underlying filesystem representation and that knowledge should be moved somewhere under src/backend/storage, but I don't see how it makes sense for the smgr layer to include assumptions about what filesystem abstraction md.c happens to implement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Aug 20, 2010 at 8:45 AM, Greg Stark <gsstark@mit.edu> wrote: > On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> A related, interesting question is whether there's any purpose to the >> smgr layer at all. Would we accept a patch that implemented an >> alternative smgr layer, perhaps on a per-tablespace basis? > > I definitely want to keep it. > > I think we could usefully do an application-level raid implementation. > It would be useful for people running on machines where they don't > have administrative access on the machine. In particular I'm picturing > shared cluster machines that run other jobs and can't be reconfigured > specifically for the database. Adding per-tablespace behaviour would > make the argument a lot stronger too since it's not so easy to set up > different stripe sizes per table if you're using OS level raid. That would actually be kind of cool. > I also have various crazy plans to experiment with network-based > storage and had intended to use smgr to do so. At google we have a > bunch of different storage technologies and they're all > application-level network services. You can always implement a fuse > module that calls back up to a daemon which acts as the client but > that doesn't make me feel any happier about it. Me either. I really like the idea of trying to use network-based storage in some way. Gigabit Ethernet is a big I/O channel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > ... Perhaps tablespace.c shouldn't assume > anything about the underlying filesystem representation and that > knowledge should be moved somewhere under src/backend/storage, but I > don't see how it makes sense for the smgr layer to include assumptions > about what filesystem abstraction md.c happens to implement. Well, the other approach we could take is to move the tablespace.c filesystem-whacking code into md.c, expose it via a new smgr API, and have commands/tablespace.c call that. I wouldn't have a layering problem with a design like that, and as you say it's probably cleaner than what's there. But having something in smgr calling something in /commands is Just Wrong. regards, tom lane
On 20/08/10 16:30, Robert Haas wrote: > I really like the idea of trying to use network-based storage in some > way. Gigabit Ethernet is a big I/O channel. NFS? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 20/08/10 16:30, Robert Haas wrote: >> >> I really like the idea of trying to use network-based storage in some >> way. Gigabit Ethernet is a big I/O channel. > > NFS? I don't particularly trust NFS to be either reliable or performant for database use. Do you? And what if you want additional functionality, like sharding or mirroring? ISTM that something built around a custom protocol that mimics exactly what we need from the smgr layer would be a lot easier to set up and a lot easier to be confident in. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 20/08/10 17:01, Robert Haas wrote: > On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 20/08/10 16:30, Robert Haas wrote: >>> >>> I really like the idea of trying to use network-based storage in some >>> way. Gigabit Ethernet is a big I/O channel. >> >> NFS? > > I don't particularly trust NFS to be either reliable or performant for > database use. Do you? Depends on the implementation, I guess, but the point is that there's a bazillion network-based filesystems with different tradeoffs out there already. It seems unlikely that you could outperform them with something built into PostgreSQL. To put it other way, if you built network-based storage into PostgreSQL, what PostgreSQL-specific knowledge could you take advanage of to make it more performant/reliable? If there isn't any, I don't see the point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 20, 2010 at 10:20 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 20/08/10 17:01, Robert Haas wrote: >> >> On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> On 20/08/10 16:30, Robert Haas wrote: >>>> >>>> I really like the idea of trying to use network-based storage in some >>>> way. Gigabit Ethernet is a big I/O channel. >>> >>> NFS? >> >> I don't particularly trust NFS to be either reliable or performant for >> database use. Do you? > > Depends on the implementation, I guess, but the point is that there's a > bazillion network-based filesystems with different tradeoffs out there > already. It seems unlikely that you could outperform them with something > built into PostgreSQL. > > To put it other way, if you built network-based storage into PostgreSQL, > what PostgreSQL-specific knowledge could you take advanage of to make it > more performant/reliable? If there isn't any, I don't see the point. PostgreSQL-specific knowledge? Probably not. But: - Setting up NFS is very easy to do wrong. I bet if you find 100 people who are running PG over NFS, 80 of them have a wrong setting somewhere that's compromising their data integrity. - NFS, like all other solutions in this area, is platform-specific, and thus not available everywhere. - We don't need a general-purpose network file system - we need something very specific, which should therefore be able to be done in a more lightweight fashion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Tom Lane's message of jue ago 19 23:35:06 -0400 2010: > Robert Haas <robertmhaas@gmail.com> writes: > > So I propose moving the TablespaceCreateDbspace() call to mdcreate(), > > removing the redundant check from smgrcreate(), and deleting that > > portion of the comment which says this is in the wrong place. > > While I don't care for having smgr.c call tablespace.c, moving the call to > md.c instead is surely not an improvement :-(. The problem here is that > we'd like the tablespace code to be above the smgr code, not below. > Calling it from md.c makes the layer inversion worse not better. I proposed moving that code to storage.c some time ago but was shot down for reasons I don't remember, and didn't press further. Perhaps the idea of moving it all to smgr.c/md.c for tablespace.c to call makes more sense; but if that doesn't happen, I still think that storage.c is a better place. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of jue ago 19 23:35:06 -0400 2010: >> While I don't care for having smgr.c call tablespace.c, moving the call to >> md.c instead is surely not an improvement :-(. The problem here is that >> we'd like the tablespace code to be above the smgr code, not below. >> Calling it from md.c makes the layer inversion worse not better. > I proposed moving that code to storage.c some time ago but was shot down > for reasons I don't remember, and didn't press further. Perhaps the > idea of moving it all to smgr.c/md.c for tablespace.c to call makes more > sense; but if that doesn't happen, I still think that storage.c is a > better place. Hmm. I've never been terribly happy with storage.c: it doesn't have any clear place in the layering, and looks like it's mostly code that has been ripped out of smgr.c for no very defensible reason, and then moved into catalog/ for even less reason. Maybe we should back up and rethink the organization of all of smgr.c/md.c/storage.c. The real underlying problem here is that there's so much stuff in commands/ that is freely violating the smgr abstraction level by poking at the filesystem directly. It's impossible to have a nice layering, or even what you could call an abstraction, as long as things stay like that. I guess it got that way because smgr.c was so useless that nobody bothered to factor it in when thinking about how to implement tablespaces and suchlike. But if we're going to do anything at all in the way of cleaning up these interfaces, we need to start with a redesign that re-establishes some clear division of labor. regards, tom lane