Re: Minmax indexes - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: Minmax indexes |
Date | |
Msg-id | 20140623170751.GA5032@eldon.alvh.no-ip.org Whole thread Raw |
In response to | Re: Minmax indexes (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Minmax indexes
|
List | pgsql-hackers |
Heikki Linnakangas wrote: > Some comments, aside from the design wrt. bounding boxes etc. : Thanks. I haven't commented on that sub-thread because I think it's possible to come up with a reasonable design that solves the issue by adding a couple of amprocs. I need to do some more thinking to ensure it is really workable, and then I'll post my ideas. > On 06/15/2014 05:34 AM, Alvaro Herrera wrote: > >Robert Haas wrote: > If I understand the code correctly, the revmap is a three-level deep > structure. The bottom level consists of "regular revmap pages", and > each regular revmap page is filled with ItemPointerDatas, which > point to the index tuples. The middle level consists of "array > revmap pages", and each array revmap page contains an array of > BlockNumbers of the "regular revmap" pages. The top level is an > array of BlockNumbers of the array revmap pages, and it is stored in > the metapage. Yep, that's correct. Essentially, we still have the revmap as a linear space (containing TIDs); the other two levels on top of that are only there to enable locating the physical page numbers for each revmap logical page. I make one exception that the first logical revmap page is always stored in page 1, to optimize the case of a smallish table (~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per range, or 170 megabytes at 16 pages per range.) Each page has a page header (24 bytes) and special space (4 bytes), so it has 8192-28=8164 bytes available for data, so 1360 item pointers. > With 8k block size, that's just enough to cover the full range of > 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each > regular revmap page can store about 8192/6 = 1365 item pointers, > each array revmap page can store about 8192/4 = 2048 block > references, and the size of the top array is 8192/4. That's just > enough; to store the required number of array pages in the top > array, the array needs to be (2^32/1365)/2048)=1536 elements large. > > But with 4k or smaller blocks, it's not enough. Yeah. As I said elsewhere, actual useful values are likely to be close to the read-ahead setting of the underlying disk; by default that'd be 16 pages (128kB), but I think it's common advice to increase the kernel setting to improve performance. I don't think we don't need to prevent minmax indexes with pages_per_range=1, but I don't think we need to ensure that that setting works with the largest tables, either, because it doesn't make any sense to set it up like that. Also, while there are some recommendations to set up a system with larger page sizes (32kB), I have never seen any recommendation to set it lower. It wouldn't make sense to build a system that has very large tables and use a smaller page size. So in other words, yes, you're correct that the mechanism doesn't work in some cases (small page size and index configured for highest level of detail), but the conditions are such that I don't think it matters. ISTM the thing to do here is to do the math at index creation time, and if we find that we don't have enough space in the metapage for all array revmap pointers we need, bail out and require the index to be created with a larger pages_per_range setting. > I wonder if it would be simpler to just always store the revmap > pages in the beginning of the index, before any other pages. Finding > the revmap page would then be just as easy as with a separate fork. > When the table/index is extended so that a new revmap page is > needed, move the existing page at that block out of the way. Locking > needs some consideration, but I think it would be feasible and > simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. > >I have followed the suggestion by Amit to overwrite the index tuple when > >a new heap tuple is inserted, instead of creating a separate index > >tuple. This saves a lot of index bloat. This required a new entry > >point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new > >function PageIndexDeleteNoCompact which is similar in spirit to > >PageIndexMultiDelete except that item pointers do not change. This is > >necessary because the revmap stores item pointers, and such reference > >would break if we were to renumber items in index pages. > > ISTM that when the old tuple cannot be updated in-place, the new > index tuple is inserted with mm_doinsert(), but the old tuple is > never deleted. It's deleted by the next vacuum. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: