Re: Dynamic shared memory areas - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Dynamic shared memory areas |
Date | |
Msg-id | CAEepm=3U7+Ro7=ECeQuAZoeFXs8iDVX56NXGCV7z3=+H+Wd0Sw@mail.gmail.com Whole thread Raw |
In response to | Re: Dynamic shared memory areas (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Dynamic shared memory areas
Re: Dynamic shared memory areas |
List | pgsql-hackers |
On Wed, Nov 30, 2016 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > More review: Thanks! > + * For large objects, we just stick all of the allocations in fullness class > + * 0. Since we can just return the space directly to the free page manager, > + * we don't really need them on a list at all, except that if someone wants > + * to bulk release everything allocated using this BlockAreaContext, we > + * have no other way of finding them. > > This comment is out-of-date. Removed. > + /* > + * If this is the only span, and there is no active > span, then maybe > + * we should probably move this span to fullness class > 1. (Otherwise > + * if you allocate exactly all the objects in the only > span, it moves > + * to class 3, then you free them all, it moves to 2, > and then is > + * given back, leaving no active span). > + */ > > "maybe we should probably" seems to have one more doubt-expressing > word than it needs. Fixed. > + if (size_class == DSA_SCLASS_SPAN_LARGE) > + /* Large object frees give back segments > aggressively already. */ > + continue; > > We generally use braces in this kind of case. Fixed. > + * Search the fullness class 1 only. That is where we > expect to find > > extra "the" Fixed. > + /* Call for effect (we don't need the result). */ > + get_segment_by_index(area, index); > ... > + return area->segment_maps[index].mapped_address + offset; > > It isn't guaranteed that area->segment_maps[index].mapped_address will > be non-NULL on return from get_segment_by_index, and then this > function will return a completely bogus pointer to the caller. I > think you should probably elog() instead. Hmm. Right. In fact it's never OK to ask for a segment by index when that segment is gone since that implies an access-after-free so there is no reason for NULL to be handled by callers. I have changed get_segment_by_index to raise an error.. > + elog(ERROR, "dsa: can't attach to area handle %u", handle); > > Avoid ":" in elog messages. You don't really need to - and it isn't > project style to - tag these with "dsa:"; that's what \errverbose or > \set VERBOSITY verbose is for. In this particular case, I might just > adopt the formulation from parallel.c: > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("could not map dynamic shared > memory segment"))); Fixed. > + elog(FATAL, > + "dsa couldn't find run of pages: > fpm_largest out of sync"); > > Here I'd go with "dsa could not find %u free pages". Fixed. > + elog(ERROR, "dsa_pin: area already pinned"); > > "dsa_area already pinned" Fixed. > + elog(ERROR, "dsa_unpin: area not pinned"); > > "dsa_area not pinned" Fixed. > + if (segment == NULL) > + elog(ERROR, "dsa: can't attach to segment"); > > As above. Fixed. > +static dsa_segment_map * > +get_segment_by_index(dsa_area *area, dsa_segment_index index) > +{ > + if (unlikely(area->segment_maps[index].mapped_address == NULL)) > + { > + dsm_handle handle; > + dsm_segment *segment; > + dsa_segment_map *segment_map; > + > + handle = area->control->segment_handles[index]; > > Don't you need to acquire the lock for this? No. I've updated the comments to explain, and refactored a bit. I'll explain here in different words here: This is memory, you are a C programmer, and as with malloc/free, referencing memory that has been freed invokes undefined behaviour possibly including but not limited to demons flying out of your nose. When you call dsa_get_address(some_dsa_pointer) or dsa_free(some_dsa_pointer) you are asserting that the address points to memory allocated with dsa_allocate from this area that has not yet been freed. Given that assertion, area->control->segment_handles[index] (where index is extracted from the address) must be valid and cannot change under your feet. control->segment_handles[index] can only change after everything allocated from that whole segment has been freed; you can think of it as 'locked' as long as any live object exists in the segment it corresponds to. In general I'm trying not to do anything too clever in the first version of DSA: it uses plain old LWLock for each size-class's pool and then an area-wide LWLock for segment operations. But in the particular case of dsa_get_address, I think it's really important for the viability of DSA for these address translations to be fast in likely path, hence my desire to figure out a protocol for lock-free address translation even though segments come and go. > + /* Check all currently mapped segments to find what's > been freed. */ > + for (i = 0; i <= area->high_segment_index; ++i) > + { > + if (area->segment_maps[i].header != NULL && > + area->segment_maps[i].header->freed) > + { > + dsm_detach(area->segment_maps[i].segment); > + area->segment_maps[i].segment = NULL; > + area->segment_maps[i].header = NULL; > + area->segment_maps[i].mapped_address = NULL; > + } > + } > + area->freed_segment_counter = freed_segment_counter; > > And this? Hmm. I had a theory for why that didn't need to be locked, though it admittedly lacked a necessary barrier -- d'oh. But I'll spare you the details and just lock it because this is not a hot path and it's much simpler that way. I've also refactored that code into a new static function check_for_freed_segments, because I realised that dsa_free needs the same treatment as dsa_get_address. The checking for freed segments was also happening at the wrong time, which I've now straightened out -- that must happen before you arrive into a get_segment_index, as described in the copious new comments. Thoughts? > +/* > + * Release a DSA area that was produced by dsa_create_in_place or > + * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX' > + * callbacks so that this is managed automatically, because failure to release > + * an area created in-place leaks its segments permanently. > + */ > +void > +dsa_release_in_place(void *place) > +{ > + decrement_reference_count((dsa_area_control *) place); > +} > > Since this seems to be the only caller of decrement_reference_count, > you could just put the logic here. Ok, done. > The contract for this function is > also a bit unclear from the header comment. I initially thought that > it was your intention that this should be called from every process > that has either created or attached the segment. That is indeed my intention. > But that doesn't > seem like it will work, because decrement_reference_count calls > dsm_unpin_segment on every segment, and a segment can only be pinned > once, so everybody else would fail. So maybe the idea is that ANY ONE > process has to call dsa_release_in_place. But then that could lead to > failures in other backends inside get_segment_by_index(), because > segments they don't have mapped might already be gone. OK, third try: > maybe the idea is that the LAST process out has to call > dsa_release_in_place(). But how do the various cooperating processes > know which one that is? It decrements the reference count for the area, but only unpins the segments if the reference count reaches zero: Assert(control->refcnt > 0); if (--control->refcnt == 0) { /* ... unpin all the segments ... */ } > I've also realized another thing that's not so good about this: > superblocks are 64kB, so allocating 64kB of initial space probably > just wastes most of it. I think we want to either allocate just > enough space to hold the control information, or else that much space > plus space for at least a few superblocks. I'm inclined to go the > first way, because it seems a bit overenthusiastic to allocate 256kB > or 512kB just on the off chance we might need it. On the other hand, > including a few bytes in the control segment so that we don't need to > allocate 1MB segment that we might not need sounds pretty sharp. > Maybe DSA can expose an API that returns the number of bytes that will > be needed for the control structure, and then the caller can arrange > for that space to be available during the Estimate phase... Yeah, I also thought about that, but didn't try to do better before because I couldn't see how to make a nice macro for this without dragging a ton of internal stuff out into the header. I have written a new function dsa_minimum_size(). The caller can use that number directly to get a minimal in-place area that will immediately create an extra DSM segment as soon as you call dsa_allocate. Unfortunately you can't really add more to that number with predictable results unless you know some internal details and your future allocation pattern: to avoid extra segment creation, you'd need to add 4KB for a block of spans and then 64KB for each size class you plan to allocate, and of course that might change. But at least it allows us to create an in-place DSA area for every parallel query cheaply, and then defer creation of the first DSM segment until the first time someone tries to allocate, which seems about right to me. And in response to your earlier email: On Tue, Nov 29, 2016 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > shm_mq_attach() made the opposite decision about how to solve this > problem, and frankly I think that API is a lot more convenient: if the > first argument to shm_mq_attach() happens to be located inside of a > DSM, you can pass the DSM as the second argument and it registers the > on_dsm_detach() hook for you. If not, you can pass NULL and deal with > it in some other way. But this makes the common case very simple. Ok, I've now done the same. I feel like some more general destructor callback for objects in containing object is wanted here, rather than sticking dsm_segment * into various constructor-like functions, but I haven't thought seriously about that and I'm not arguing that case now. Please find attached dsa-v8.patch, and also a small test module for running random allocate/free exercises and dumping the internal allocator state. -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: