It's past time to redo the smgr API - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | It's past time to redo the smgr API |
Date | |
Msg-id | 5416.1076007946@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: It's past time to redo the smgr API
Re: It's past time to redo the smgr API Re: It's past time to redo the smgr API |
List | pgsql-hackers |
A long time ago Vadim proposed that we should revise smgr.c's API so that it does not depend on Relations (relcache entries); rather, only a RelFileNode value should be needed to access a file in smgr and lower levels. This would allow us to get rid of the concept of "blind writes". As of CVS tip, with most writes being done by the bgwriter, most writes are blind. Try tracing the bgwriter: every write involves a file open and close, which has got to be hurting its performance. The same has been true for awhile now of the background checkpointer. I'm motivated to do something about this today because I'm reviewing the PITR patch that J.R. Nield and Patrick Macdonald were working on, and one of the things I don't care for is the kluges it adds to the smgr API to work around lack of a Relation in some operations. I propose the following revisions: * smgr.c should internally keep a hashtable (indexed by RelFileNode) with "struct SMgrRelation" entries for each relation that has recently been accessed. File access information for the lower-level modules (md.c, fd.c) would appear in this hashtable entry. * smgropen takes a RelFileNode and returns a pointer to a hashtable entry, which is used as the argument to all other smgr operations. smgropen itself does not cause any actual I/O; thus it doesn't matter if the file doesn't exist yet. smgropen followed by smgrcreate is the way to create a new relation on disk. Without smgrcreate, file existence will be checked at the first read or write attempt. * smgrclose closes the md.c-level file and drops the hashtable entry. Hashtable entries remain valid unless explicitly closed (thus, multiple smgropens for the same file are legal). * The rd_fd field of relcache nodes will be replaced by an "SMgrRelation *" pointer, so that callers having a Relation reference can access the smgr file easily. (In particular, this means that most calls to the bufmgr will still pass Relations, and we don't need to propagate the API change up to the bufmgr level.) * The existing places in the bufmgr that do RelationNodeCacheGetRelation followed by either regular or blind write would be replaced by smgropen followed by smgrwrite. Since smgropen is just a hash table lookup, it is no more expensive than the RelationNodeCacheGetRelation it replaces. Note these places would *not* do smgrclose afterwards. * Because we don't smgrclose after a write, it is possible to have "dangling" smgr entries that aren't useful any more, as well as open file descriptors underneath them. This isn't too big a deal on Unix but it will be fatal for the Windows port, since it would prevent a DROP TABLE if some other backend happens to have touched the table. What I propose to do about this is: 1. In the bgwriter, at each checkpoint do "smgrcloseall" to close all open files.2. In regular backends, receipt of a relcache flush message will result in smgrclose(), even if there is not a relcacheentry for the given relation ID. A global cache flush event (sinval buffer overflow) causes smgrcloseall. This ensures that open descriptors for a dead relation will not persist past the next checkpoint. We had already agreed, I think, that the best solution for Windows' problem with deleting open files is to retry pending deletes at checkpoint. This smgr rewrite will not contribute to actually making that happen, but it won't make things worse either. * I'm going to get rid of the "which" argument that once selected a particular smgr implementation (md.c vs mm.c). It's vestigial at present and is misdefined anyway. If we were to resurrect the concept of multiple storage managers, we'd need the selection to be determinable from the RelFileNode value, rather than being a separate argument. (When we add tablespaces, we could imagine associating smgrs with tablespaces, but it would have to be the responsibility of smgr.c to determine which smgr to use based on the tablespace ID.) * We can remove the "dummy cache" support in relcache.c, as well as RelationNodeCacheGetRelation() and the hashtable that supports it. * The smgr hashtable entries could possibly be used for recording other status; for instance keeping track of which relations need fsync in the bgwriter. I'm also thinking of merging smgr.c's existing list-of-rels-to-be-deleted into this data structure. * AFAICS the only downside of not having a Relation available in smgr.c and md.c is that error messages could only refer to the RelFileNode numbers and not to the relation name. I'm not sure this is bad, since in my experience what you want to know about such errors is the actual disk filename, which RelFileNode tells you and relation name doesn't. We could preserve the current behavior by passing the relation name to smgropen when available, and saving the name in struct SMgrRelation. But I'm inclined not to. Comments? regards, tom lane
pgsql-hackers by date: