dynamic shared memory - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | dynamic shared memory |
Date | |
Msg-id | CA+TgmoaDqDUgt=4Zs_QPOnBt=EstEaVNP+5t+m=FPNWshiPR3A@mail.gmail.com Whole thread Raw |
Responses |
Re: dynamic shared memory
Re: dynamic shared memory |
List | pgsql-hackers |
Please find attached a first version of a patch to allow additional "dynamic" shared memory segments; that is, shared memory segments that are created after server startup, live for a period of time, and are then destroyed when no longer needed. The main purpose of this patch is to facilitate parallel query: if we've got multiple backends working on the same query, they're going to need a way to communicate. Doing that through the main shared memory segment seems infeasible because we could, for some applications, need to share very large amounts of data. For example, for internal sort, we basically load the data to be sorted into memory and then rearrange an array of pointers to the items being sorted. For parallel internal sort, we might want to do much the same thing, but with different backend processes manipulating different parts of the array. I'm not exactly sure how that's going to work out yet in detail, but it seems fair to say that the amount of data we want to share between processes there could be quite a bit larger than anything we'd feel comfortable nailing down in the permanent shared memory segment. Other cases, like parallel sequential scan, might require much smaller buffers, since there might not be much point in letting the scan get too far ahead if nothing's consuming the tuples it produces. With this infrastructure, we can choose at run-time exactly how much memory to allocate for a particular purpose and return it to the operating system as soon as we're done with it. Creating a shared memory segment is a somewhat operating-system dependent task. I decided that it would be smart to support several different implementations and to let the user choose which one they'd like to use via a new GUC, dynamic_shared_memory_type. Since we currently require System V shared memory to be supported on all platforms other than Windows, I have included a System V implementation (shmget, shmctl, shmat, shmdt). However, as we know, on many systems, System V shared memory limits are often low out of the box and raising them is an annoyance for users. Therefore, I've included an implementation based on POSIX shared memory facilities (shm_open, shm_unlink), which is the default on systems where those facilities are supported (some of the BSDs do not, I believe). We will also need a Windows implementation, which I have not attempted, but one of my colleagues at EnterpriseDB will be filling in that gap. In addition, I've included an implementation based on mmap of a plain file. As compared with a true shared memory implementation, this obviously has the disadvantage that the OS may be more likely to decide to write back dirty pages to disk, which could hurt performance. However, I believe it's worthy of inclusion all the same, because there are a variety of situations in which it might be more convenient than one of the other implementations. One is debugging. On MacOS X, for example, there seems to be no way to list POSIX shared memory segments, and no easy way to inspect the contents of either POSIX or System V shared memory segments. Another use case is working around an administrator-imposed or OS-imposed shared memory limit. If you're not allowed to allocate shared memory, but you are allowed to create files, then this implementation will let you use whatever facilities we build on top of dynamic shared memory anyway. A third possible reason to use this implementation is compartmentalization. For example, you can put the directory that stores the dynamic shared memory segments on a RAM disk - which removes the performance concern - and then do whatever you like with that directory: secure it, put filesystem quotas on it, or sprinkle magic pixie dust on it. It doesn't even seem out of the question that there might be cases where there are multiple RAM disks present with different performance characteristics (e.g. on NUMA machines) and this would provide fine-grained control over where your shared memory segments get placed. To make a long story short, I won't be crushed if the consensus is against including this, but I think it's useful. Other implementations are imaginable but not implemented here. For example, you can imagine using the mmap() of an anonymous file. However, since the point is that these segments are created on the fly by individual backends and then shared with other backends, that gets a little tricky. In order for the second backend to map the same anonymous shared memory segment that the first one mapped, you'd have to pass the file descriptor from one process to the other. There are ways, on most if not all platforms, to pass file descriptors through sockets, but there's not automatically a socket connection between the two processes either, so it gets hairy to think about making this work. I did, however, include a "none" implementation which has the effect of shutting the facility off altogether. The actual implementation is split up into two layers. dsm_impl.c/h encapsulate the implementation-dependent functionality at a very raw level, while dsm.c/h wrap that functionality in a more palatable API. Most of that wrapper layer is concerned with just one problem: avoiding leaks. This turned out to require multiple levels of safeguards, which I duly implemented. First, dynamic shared memory segments need to be reference-counted, so that when the last mapping is removed, the segment automatically goes away (we could allow for server-lifespan segments as well with only trivial changes, but I'm not sure whether there are compelling use cases for that). If a backend is terminated uncleanly, the postmaster needs to remove all leftover segments during the crash-and-restart process, just as it needs to reinitialize the main shared memory segment. And if all processes are terminated uncleanly, the next postmaster startup needs to clean up any segments that still exist, again just as we already do for the main shared memory segment. Neither POSIX shared memory nor System V shared memory provide an API for enumerating all existing shared memory segments, so we must keep track ourselves of what we have outstanding. Second, we need to ensure, within the scope of an individual process, that we only retain a mapping for as long as necessary. Just as memory contexts, locks, buffer pins, and other resources automatically go away at the end of a query or (sub)transaction, dynamic shared memory mappings created for a purpose such as parallel sort need to go away if we abort mid-way through. Of course, if you have a user backend coordinating with workers, it seems pretty likely that the workers are just going to exit if they hit an error, so having the mapping be process-lifetime wouldn't necessarily be a big deal; but the user backend may stick around for a long time and execute other queries, and we can't afford to have it accumulate mappings, not least because that's equivalent to a session-lifespan memory leak. To help solve these problems, I invented something called the "dynamic shared memory control segment". This is a dynamic shared memory segment created at startup (or reinitialization) time by the postmaster before any user process are created. It is used to store a list of the identities of all the other dynamic shared memory segments we have outstanding and the reference count of each. If the postmaster goes through a crash-and-reset cycle, it scans the control segment and removes all the other segments mentioned there, and then recreates the control segment itself. If the postmaster is killed off (e.g. kill -9) and restarted, it locates the old control segment and proceeds similarly. If the whole operating system is rebooted, the old control segment won't exist any more, but that's OK, because none of the other segments will either - except under the mmap-a-regular-file implementation, which handles cleanup by scanning the relevant directory rather than relying on the control segment. These precautions seem sufficient to ensure that dynamic shared memory segments can't survive the postmaster itself short of a hard kill, and that even after a hard kill we'll clean things up on a subsequent postmaster startup. The other problem, of making sure that segments get unmapped at the proper time, is solved using the resource owner mechanism. There is an API to create a mapping which is session-lifespan rather than resource-owner lifespan, but the default is resource-owner lifespan, which I suspect will be right for common uses. Thus, there are four separate occasions on which we remove shared memory segments: (1) resource owner cleanup, (2) backend exit (for any session-lifespan mappings and anything else that slips through the cracks), (3) postmaster exit (in case a child dies without cleaning itself up), and (4) postmaster startup (in case the postmaster dies without cleaning up). There are quite a few problems that this patch does not solve. First, while it does give you a shared memory segment, it doesn't provide you with any help at all in figuring out what to put in that segment. The task of figuring out how to communicate usefully through shared memory is thus, for the moment, left entirely to the application programmer. While there may be cases where that's just right, I suspect there will be a wider range of cases where it isn't, and I plan to work on some additional facilities, sitting on top of this basic structure, next, though probably as a separate patch. Second, it doesn't make any policy decisions about what is sensible either in terms of number of shared memory segments or the sizes of those segments, even though there are serious practical limits in both cases. Actually, the total number of segments system-wide is limited by the size of the control segment, which is sized based on MaxBackends. But there's nothing to keep a single backend from eating up all the slots, even though that's pretty both unfriendly and unportable, and there's no real limit to the amount of memory it can gobble up per slot, either. In other words, it would be a bad idea to write a contrib module that exposes a relatively uncooked version of this layer to the user. But, just for testing purposes, I did just that. The attached patch includes contrib/dsm_demo, which lets you say dsm_demo_create('something') in one string, and if you pass the return value to dsm_demo_read() in the same or another session during the lifetime of the first session, you'll read back the same value you saved. This is not, by any stretch of the imagination, a demonstration of the right way to use this facility - but as a crude unit test, it suffices. Although I'm including it in the patch file, I would anticipate removing it before commit. Hopefully, with a little more functionality on top of what's included here, we'll soon be in a position to build something that might actually be useful to someone, but this layer itself is a bit too impoverished to build something really cool, at least not without more work than I wanted to put in as part of the development of this patch. Using that crappy contrib module, I verified that the POSIX, System V, and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and on Linux (Fedora 16). I wouldn't like to have to wager on having gotten all of the details right to be absolutely portable everywhere, so I wouldn't be surprised to see this break on other systems. Hopefully that will be a matter of adjusting the configure tests a bit rather than coping with substantive changes in available functionality, but we'll see. Finally, I'd like to thank Noah Misch for a lot of discussion and thought on that enabled me to make this patch much better than it otherwise would have been. Although I didn't adopt Noah's preferred solutions to all of the problems, and although there are probably still some problems buried here, there would have been more if not for his advice. I'd also like to thank the entire database server team at EnterpriseDB for allowing me to dump large piles of work on them so that I could work on this, and my boss, Tom Kincaid, for not allowing other people to dump large piles of work on me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: