Relation extension scalability - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Relation extension scalability |
Date | |
Msg-id | 20150329185619.GA29062@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Relation extension scalability
Re: Relation extension scalability Re: Relation extension scalability Re: Relation extension scalability Re: Relation extension scalability |
List | pgsql-hackers |
Hello, Currently bigger shared_buffers settings don't combine well with relations being extended frequently. Especially if many/most pages have a high usagecount and/or are dirty and the system is IO constrained. As a quick recap, relation extension basically works like: 1) We lock the relation for extension 2) ReadBuffer*(P_NEW) is being called, to extend the relation 3) smgrnblocks() is used to find the new target block 4) We search for a victim buffer (via BufferAlloc()) to put the new block into 5) If dirty the victim buffer is cleaned 6) The relation is extended using smgrextend() 7) The page is initialized The problems come from 4) and 5) potentially each taking a fair while. If the working set mostly fits into shared_buffers 4) can requiring iterating over all shared buffers several times to find a victim buffer. If the IO subsystem is buys and/or we've hit the kernel's dirty limits 5) can take a couple seconds. I've prototyped solving this for heap relations moving the smgrnblocks() + smgrextend() calls to RelationGetBufferForTuple(). With some care (including a retry loop) it's possible to only do those two under the extension lock. That indeed fixes problems in some of my tests. I'm not sure whether the above is the best solution however. For one I think it's not necessarily a good idea to opencode this in hio.c - I've not observed it, but this probably can happen for btrees and such as well. For another, this is still a exclusive lock while we're doing IO: smgrextend() wants a page to write out, so we have to be careful not to overwrite things. There's two things that seem to make sense to me: First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW and introduce a bufmgr function specifically for extension. Secondly I think we could maybe remove the requirement of needing an extension lock alltogether. It's primarily required because we're worried that somebody else can come along, read the page, and initialize it before us. ISTM that could be resolved by *not* writing any data via smgrextend()/mdextend(). If we instead only do the write once we've read in & locked the page exclusively there's no need for the extension lock. We probably still should write out the new page to the OS immediately once we've initialized it; to avoid creating sparse files. The other reason we need the extension lock is that code like lazy_scan_heap() and btvacuumscan() that tries to avoid initializing pages that are about to be initilized by the extending backend. I think we should just remove that code and deal with the problem by retrying in the extending backend; that's why I think moving extension to a different file might be helpful. I've attached my POC for heap extension, but it's really just a POC at this point. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: