Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information) - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information) |
Date | |
Msg-id | CA+HiwqGO9RM5ak2kVMTjbYKNthf5oEE7TM3cM_zY1uVWmG8iYg@mail.gmail.com Whole thread Raw |
In response to | GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: GIN pending list pages not recycled promptly (was Re:
GIN improvements part 1: additional information)
|
List | pgsql-hackers |
On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 01/22/2014 03:39 AM, Tomas Vondra wrote: >> >> What annoys me a bit is the huge size difference between the index >> updated incrementally (by a sequence of INSERT commands), and the index >> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch >> (2288 vs. 2035 MB), but is there a chance to improve this? > > > Hmm. What seems to be happening is that pending item list pages that the > fast update mechanism uses are not getting recycled. When enough list pages > are filled up, they are flushed into the main index and the list pages are > marked as deleted. But they are not recorded in the FSM, so they won't be > recycled until the index is vacuumed. Almost all of the difference can be > attributed to deleted pages left behind like that. > > So this isn't actually related to the packed postinglists patch at all. It > just makes the bloat more obvious, because it makes the actual size of the > index size, excluding deleted pages, smaller. But it can be observed on git > master as well: > > I created a simple test table and index like this: > > create table foo (intarr int[]); > create index i_foo on foo using gin(intarr) with (fastupdate=on); > > I filled the table like this: > > insert into foo select array[-1] from generate_series(1, 10000000) g; > > postgres=# \d+i > List of relations > Schema | Name | Type | Owner | Size | Description > --------+------+-------+--------+--------+------------- > public | foo | table | heikki | 575 MB | > (1 row) > > postgres=# \di+ > List of relations > Schema | Name | Type | Owner | Table | Size | Description > --------+-------+-------+--------+-------+--------+------------- > public | i_foo | index | heikki | foo | 251 MB | > (1 row) > > I wrote a little utility that scans all pages in a gin index, and prints out > the flags indicating what kind of a page it is. The distribution looks like > this: > > 19 DATA > 7420 DATA LEAF > 24701 DELETED > 1 LEAF > 1 META > > I think we need to add the deleted pages to the FSM more aggressively. > > I tried simply adding calls to RecordFreeIndexPage, after the list pages > have been marked as deleted, but unfortunately that didn't help. The problem > is that the FSM is organized into a three-level tree, and > RecordFreeIndexPage only updates the bottom level. The upper levels are not > updated until the FSM is vacuumed, so the pages are still not visible to > GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a > call to IndexFreeSpaceMapVacuum after flushing the pending list, per > attached patch. I'm slightly worried about the performance impact of the > IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which > isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update > the upper levels of the FSM in a retail-fashion instead. > I wonder if you pursued this further? You recently added a number of TODO items related to GIN index; is it worth adding this to the list? -- Amit
pgsql-hackers by date: