Thread: small smgrcreate cleanup patch

small smgrcreate cleanup patch

From

Robert Haas

Date:

19 August 2010, 23:44:07

smgrcreate() currently contains a call to TablespaceCreateDbspace().
As the comment says, this is a rather silly place to put it.  The
silliest thing about it, IMHO, is that it forces the following check
to be done in both smgrcreate() and mdcreate():

      if (isRedo && reln->md_fd[forknum] != NULL)
              return;

So I propose moving the TablespaceCreateDbspace() call to mdcreate(),
removing the redundant check from smgrcreate(), and deleting that
portion of the comment which says this is in the wrong place.  You
could argue that perhaps md.c isn't the right place either, but it
certainly makes more sense than smgr.c, and I'd argue it's exactly
right.  Moving the TablespaceCreateDbspace() logic into md.c (or
smgr.c) seems like it would be more awkward, though I suppose it could
be done.  One less-than-thrilling result would be that the
TablespaceCreateLock logic wouldn't be confined to tablespace.c, not
that that's *necessarily* a disaster.

A related, interesting question is whether there's any purpose to the
smgr layer at all.  Would we accept a patch that implemented an
alternative smgr layer, perhaps on a per-tablespace basis?  The only
real alternative I can imagine to "magnetic disk" storage would be
network storage.  You could imagine using such a thing for temporary
tablespaces, essentially writing out temporary table pages to a RAM
cache on a remote machine; or perhaps more interestingly, using it for
fault tolerance, keeping 2 or 3 copies of each page on different
servers.  Maybe someone will say "hey, just use iSCSI" or "hey, just
use AFS" or something like that, but I dunno if I trust them to do the
right thing with fsync, and they might not be as well-optimized as
would be possible if we rolled our own.  At any rate, if we DON'T
think we'd ever consider supporting something like this, perhaps we
ought to just fold the md.c stuff into smgr.c and eliminate all the
jumping through hoops.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

smgrcreate_cleanup.patch

Re: small smgrcreate cleanup patch

From

Tom Lane

Date:

20 August 2010, 00:35:16

Robert Haas <robertmhaas@gmail.com> writes:
> So I propose moving the TablespaceCreateDbspace() call to mdcreate(),
> removing the redundant check from smgrcreate(), and deleting that
> portion of the comment which says this is in the wrong place.

While I don't care for having smgr.c call tablespace.c, moving the call to
md.c instead is surely not an improvement :-(.  The problem here is that
we'd like the tablespace code to be above the smgr code, not below.
Calling it from md.c makes the layer inversion worse not better.

> You could argue that perhaps md.c isn't the right place either, but it
> certainly makes more sense than smgr.c, and I'd argue it's exactly
> right.

On what grounds pray tell?

> A related, interesting question is whether there's any purpose to the
> smgr layer at all.

Not a lot anymore, I think.  This has been ranted about before, but
I think the Berkeley guys bet on the wrong horse when they put an API
layer here.  The facilities that might once have been usefully switched
between at this level are now all down inside kernel device drivers.
I could see merging smgr.c and md.c entirely.  But they'd still be
appropriately placed below, not above, the tablespace commands.
        regards, tom lane

Re: small smgrcreate cleanup patch

From

Greg Stark

Date:

20 August 2010, 09:45:38

On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> A related, interesting question is whether there's any purpose to the
> smgr layer at all.  Would we accept a patch that implemented an
> alternative smgr layer, perhaps on a per-tablespace basis?

I definitely want to keep it.

I think we could usefully do an application-level raid implementation.
It would be useful for people running on machines where they don't
have administrative access on the machine. In particular I'm picturing
shared cluster machines that run other jobs and can't be reconfigured
specifically for the database. Adding per-tablespace behaviour would
make the argument a lot stronger too since it's not so easy to set up
different stripe sizes per table if you're using OS level raid.

I also have various crazy plans to experiment with network-based
storage and had intended to use smgr to do so. At google we have a
bunch of different storage technologies and they're all
application-level network services. You can always implement a fuse
module that calls back up to a daemon which acts as the client but
that doesn't make me feel any happier about it.

And I know EDB has their infinicache thing using memcached -- I don't
recall if it uses the smgr layer but it would certainly be a natural
place to hook it in.

I guess my point here is that regardless of whether we plan on
accepting any such patches in core it's a very handy hook for third
parties to extend postgres with. It would be nice if we had some of
those modules in contrib to keep us honest with the api but even as it
stands I think it's useful.

--
greg

Re: small smgrcreate cleanup patch

From

Heikki Linnakangas

Date:

20 August 2010, 09:54:54

On 20/08/10 15:45, Greg Stark wrote:
> On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> A related, interesting question is whether there's any purpose to the
>> smgr layer at all.  Would we accept a patch that implemented an
>> alternative smgr layer, perhaps on a per-tablespace basis?
>
> I definitely want to keep it.
>
> I think we could usefully do an application-level raid implementation.
> It would be useful for people running on machines where they don't
> have administrative access on the machine. In particular I'm picturing
> shared cluster machines that run other jobs and can't be reconfigured
> specifically for the database. Adding per-tablespace behaviour would
> make the argument a lot stronger too since it's not so easy to set up
> different stripe sizes per table if you're using OS level raid.
>
> I also have various crazy plans to experiment with network-based
> storage and had intended to use smgr to do so. At google we have a
> bunch of different storage technologies and they're all
> application-level network services. You can always implement a fuse
> module that calls back up to a daemon which acts as the client but
> that doesn't make me feel any happier about it.

I think you would be better off implementing that somewhere in md.c, or 
even file.c. The smgr layer has been dead for such a long time that it 
would take a significant amount of work to make it usable again. The 
whole infrastructure to handle fsyncs() at checkpoints, for example, is 
in md.c, so you'd need to rewrite all that.

> And I know EDB has their infinicache thing using memcached -- I don't
> recall if it uses the smgr layer but it would certainly be a natural
> place to hook it in.

FWIW, it doesn't. It's in bufmgr, it needs some co-operation from the 
buffer cache.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: small smgrcreate cleanup patch

From

Robert Haas

Date:

20 August 2010, 10:27:58

On Thu, Aug 19, 2010 at 11:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While I don't care for having smgr.c call tablespace.c, moving the call to
> md.c instead is surely not an improvement :-(.  The problem here is that
> we'd like the tablespace code to be above the smgr code, not below.
> Calling it from md.c makes the layer inversion worse not better.
>
>> You could argue that perhaps md.c isn't the right place either, but it
>> certainly makes more sense than smgr.c, and I'd argue it's exactly
>> right.
>
> On what grounds pray tell?

If smgr wants to even have the pretense of being an abstraction layer,
it can't very well know about the underlying file system structure.
But there's no getting around the fact that md.c has to know that
stuff; it has to create and write those files.  There is, perhaps, a
layer inversion problem here, but if anything I think it's that the
functionality of tablespace.c spans everything from the SQL layer all
the way down to pushing bits in the filesystem.  But that's not really
the fault of smgr.c/md.c.  Perhaps tablespace.c shouldn't assume
anything about the underlying filesystem representation and that
knowledge should be moved somewhere under src/backend/storage, but I
don't see how it makes sense for the smgr layer to include assumptions
about what filesystem abstraction md.c happens to implement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: small smgrcreate cleanup patch

From

Robert Haas

Date:

20 August 2010, 10:30:15

On Fri, Aug 20, 2010 at 8:45 AM, Greg Stark <gsstark@mit.edu> wrote:
> On Fri, Aug 20, 2010 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> A related, interesting question is whether there's any purpose to the
>> smgr layer at all.  Would we accept a patch that implemented an
>> alternative smgr layer, perhaps on a per-tablespace basis?
>
> I definitely want to keep it.
>
> I think we could usefully do an application-level raid implementation.
> It would be useful for people running on machines where they don't
> have administrative access on the machine. In particular I'm picturing
> shared cluster machines that run other jobs and can't be reconfigured
> specifically for the database. Adding per-tablespace behaviour would
> make the argument a lot stronger too since it's not so easy to set up
> different stripe sizes per table if you're using OS level raid.

That would actually be kind of cool.

> I also have various crazy plans to experiment with network-based
> storage and had intended to use smgr to do so. At google we have a
> bunch of different storage technologies and they're all
> application-level network services. You can always implement a fuse
> module that calls back up to a daemon which acts as the client but
> that doesn't make me feel any happier about it.

Me either.

I really like the idea of trying to use network-based storage in some
way.  Gigabit Ethernet is a big I/O channel.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: small smgrcreate cleanup patch

From

Tom Lane

Date:

20 August 2010, 10:36:13

Robert Haas <robertmhaas@gmail.com> writes:
> ...  Perhaps tablespace.c shouldn't assume
> anything about the underlying filesystem representation and that
> knowledge should be moved somewhere under src/backend/storage, but I
> don't see how it makes sense for the smgr layer to include assumptions
> about what filesystem abstraction md.c happens to implement.

Well, the other approach we could take is to move the tablespace.c
filesystem-whacking code into md.c, expose it via a new smgr API, and
have commands/tablespace.c call that.  I wouldn't have a layering
problem with a design like that, and as you say it's probably cleaner
than what's there.  But having something in smgr calling something in
/commands is Just Wrong.
        regards, tom lane

Re: small smgrcreate cleanup patch

From

Heikki Linnakangas

Date:

20 August 2010, 10:51:16

On 20/08/10 16:30, Robert Haas wrote:
> I really like the idea of trying to use network-based storage in some
> way.  Gigabit Ethernet is a big I/O channel.

NFS?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: small smgrcreate cleanup patch

From

Robert Haas

Date:

20 August 2010, 11:02:06

On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20/08/10 16:30, Robert Haas wrote:
>>
>> I really like the idea of trying to use network-based storage in some
>> way.  Gigabit Ethernet is a big I/O channel.
>
> NFS?

I don't particularly trust NFS to be either reliable or performant for
database use.  Do you?  And what if you want additional functionality,
like sharding or mirroring?  ISTM that something built around a custom
protocol that mimics exactly what we need from the smgr layer would be
a lot easier to set up and a lot easier to be confident in.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: small smgrcreate cleanup patch

From

Heikki Linnakangas

Date:

20 August 2010, 11:20:38

On 20/08/10 17:01, Robert Haas wrote:
> On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 20/08/10 16:30, Robert Haas wrote:
>>>
>>> I really like the idea of trying to use network-based storage in some
>>> way.  Gigabit Ethernet is a big I/O channel.
>>
>> NFS?
>
> I don't particularly trust NFS to be either reliable or performant for
> database use.  Do you?

Depends on the implementation, I guess, but the point is that there's a 
bazillion network-based filesystems with different tradeoffs out there 
already. It seems unlikely that you could outperform them with something 
built into PostgreSQL.

To put it other way, if you built network-based storage into PostgreSQL, 
what PostgreSQL-specific knowledge could you take advanage of to make it 
more performant/reliable? If there isn't any, I don't see the point.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: small smgrcreate cleanup patch

From

Robert Haas

Date:

20 August 2010, 11:30:55

On Fri, Aug 20, 2010 at 10:20 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20/08/10 17:01, Robert Haas wrote:
>>
>> On Fri, Aug 20, 2010 at 9:51 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> On 20/08/10 16:30, Robert Haas wrote:
>>>>
>>>> I really like the idea of trying to use network-based storage in some
>>>> way.  Gigabit Ethernet is a big I/O channel.
>>>
>>> NFS?
>>
>> I don't particularly trust NFS to be either reliable or performant for
>> database use.  Do you?
>
> Depends on the implementation, I guess, but the point is that there's a
> bazillion network-based filesystems with different tradeoffs out there
> already. It seems unlikely that you could outperform them with something
> built into PostgreSQL.
>
> To put it other way, if you built network-based storage into PostgreSQL,
> what PostgreSQL-specific knowledge could you take advanage of to make it
> more performant/reliable? If there isn't any, I don't see the point.

PostgreSQL-specific knowledge?  Probably not.  But:

- Setting up NFS is very easy to do wrong.  I bet if you find 100
people who are running PG over NFS, 80 of them have a wrong setting
somewhere that's compromising their data integrity.
- NFS, like all other solutions in this area, is platform-specific,
and thus not available everywhere.
- We don't need a general-purpose network file system - we need
something very specific, which should therefore be able to be done in
a more lightweight fashion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: small smgrcreate cleanup patch

From

Alvaro Herrera

Date:

20 August 2010, 14:27:26

Excerpts from Tom Lane's message of jue ago 19 23:35:06 -0400 2010:
> Robert Haas <robertmhaas@gmail.com> writes:
> > So I propose moving the TablespaceCreateDbspace() call to mdcreate(),
> > removing the redundant check from smgrcreate(), and deleting that
> > portion of the comment which says this is in the wrong place.
> 
> While I don't care for having smgr.c call tablespace.c, moving the call to
> md.c instead is surely not an improvement :-(.  The problem here is that
> we'd like the tablespace code to be above the smgr code, not below.
> Calling it from md.c makes the layer inversion worse not better.

I proposed moving that code to storage.c some time ago but was shot down
for reasons I don't remember, and didn't press further.  Perhaps the
idea of moving it all to smgr.c/md.c for tablespace.c to call makes more
sense; but if that doesn't happen, I still think that storage.c is a
better place.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: small smgrcreate cleanup patch

From

Tom Lane

Date:

20 August 2010, 14:37:51

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of jue ago 19 23:35:06 -0400 2010:
>> While I don't care for having smgr.c call tablespace.c, moving the call to
>> md.c instead is surely not an improvement :-(.  The problem here is that
>> we'd like the tablespace code to be above the smgr code, not below.
>> Calling it from md.c makes the layer inversion worse not better.

> I proposed moving that code to storage.c some time ago but was shot down
> for reasons I don't remember, and didn't press further.  Perhaps the
> idea of moving it all to smgr.c/md.c for tablespace.c to call makes more
> sense; but if that doesn't happen, I still think that storage.c is a
> better place.

Hmm.  I've never been terribly happy with storage.c: it doesn't have any
clear place in the layering, and looks like it's mostly code that has
been ripped out of smgr.c for no very defensible reason, and then moved
into catalog/ for even less reason.  Maybe we should back up and rethink
the organization of all of smgr.c/md.c/storage.c.

The real underlying problem here is that there's so much stuff in
commands/ that is freely violating the smgr abstraction level by poking
at the filesystem directly.  It's impossible to have a nice layering,
or even what you could call an abstraction, as long as things stay like
that.  I guess it got that way because smgr.c was so useless that nobody
bothered to factor it in when thinking about how to implement
tablespaces and suchlike.  But if we're going to do anything at all in
the way of cleaning up these interfaces, we need to start with a
redesign that re-establishes some clear division of labor.
        regards, tom lane