Re: Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Vacuum: allow usage of more than 1GB of work mem |
Date | |
Msg-id | CAD21AoAejVzY-0WivN5HfxKiyXcoaWushvAvgpS86N43u2CitA@mail.gmail.com Whole thread Raw |
In response to | Re: Vacuum: allow usage of more than 1GB of work mem (Pavan Deolasee <pavan.deolasee@gmail.com>) |
Responses |
Re: Vacuum: allow usage of more than 1GB of work mem
|
List | pgsql-hackers |
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire@gmail.com> > wrote: >> >> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote: >> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> >> > wrote: >> >> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> >>> The idea of looking to the stats to *guess* about how many tuples are >> >>> removable doesn't seem bad at all. But imagining that that's going to >> >>> be >> >>> exact is folly of the first magnitude. >> >> >> >> Yes. Bear in mind I had already referred to allowing +10% to be safe, >> >> so I think we agree that a reasonably accurate, yet imprecise >> >> calculation is possible in most cases. >> > >> > That would all be well and good if it weren't trivial to do what >> > Robert suggested. This is just a large unsorted list that we need to >> > iterate throught. Just allocate chunks of a few megabytes and when >> > it's full allocate a new chunk and keep going. There's no need to get >> > tricky with estimates and resizing and whatever. >> >> I agree. While the idea of estimating the right size sounds promising >> a priori, considering the estimate can go wrong and over or >> underallocate quite severely, the risks outweigh the benefits when you >> consider the alternative of a dynamic allocation strategy. >> >> Unless the dynamic strategy has a bigger CPU impact than expected, I >> believe it's a superior approach. >> > > How about a completely different representation for the TID array? Now this > is probably not something new, but I couldn't find if the exact same idea > was discussed before. I also think it's somewhat orthogonal to what we are > trying to do here, and will probably be a bigger change. But I thought I'll > mention since we are at the topic. > > What I propose is to use a simple bitmap to represent the tuples. If a tuple > at <block, offset> is dead then the corresponding bit in the bitmap is set. > So clearly the searches through dead tuples are O(1) operations, important > for very large tables and large arrays. > > Challenge really is that a heap page can theoretically have MaxOffsetNumber > of line pointers (or to be precise maximum possible offset number). For a 8K > block, that comes be about 2048. Having so many bits per page is neither > practical nor optimal. But in practice the largest offset on a heap page > should not be significantly greater than MaxHeapTuplesPerPage, which is a > more reasonable value of 291 on my machine. Again, that's with zero sized > tuple and for real life large tables, with much wider tuples, the number may > go down even further. > > So we cap the offsets represented in the bitmap to some realistic value, > computed by looking at page density and then multiplying it by a small > factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line > pointers. That should practically represent majority of the dead tuples in > the table, but we then keep an overflow area to record tuples beyond the > limit set for per page. The search routine will do a direct lookup for > offsets less than the limit and search in the sorted overflow area for > offsets beyond the limit. > > For example, for a table with 60 bytes wide tuple (including 24 byte > header), each page can approximately have 8192/60 = 136 tuples. Say we > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the > bitmap. First 272 offsets in every page are represented in the bitmap and > anything greater than are in overflow region. On the other hand, the current > representation will need about 16 bytes per page assuming 2% dead tuples, 40 > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead > tuples. So bitmap will take more space for small tuples or when vacuum is > run very aggressively, both seems unlikely for very large tables. Of course > the calculation does not take into account the space needed by the overflow > area, but I expect that too be small. > > I guess we can make a choice between two representations at the start > looking at various table stats. We can also be smart and change from bitmap > to traditional representation as we scan the table and see many more tuples > in the overflow region than we provisioned for. There will be some > challenges in converting representation mid-way, especially in terms of > memory allocation, but I think those can be sorted out if we think that the > idea has merit. > Making the vacuum possible to choose between two data representations sounds good. I implemented the patch that changes dead tuple representation to bitmap before. I will measure the performance of bitmap representation again and post them. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
pgsql-hackers by date: