Re: Relation extension scalability - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Relation extension scalability |
Date | |
Msg-id | 13296.1437319705@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Relation extension scalability (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Relation extension scalability
|
List | pgsql-hackers |
Andres Freund <andres@anarazel.de> writes: > So, to get to the actual meat: My goal was to essentially get rid of an > exclusive lock over relation extension alltogether. I think I found a > way to do that that addresses the concerns made in this thread. > Thew new algorithm basically is: > 1) Acquire victim buffer, clean it, and mark it as pinned > 2) Get the current size of the relation, save buffer into blockno > 3) Try to insert an entry into the buffer table for blockno > 4) If the page is already in the buffer table, increment blockno by 1, > goto 3) > 5) Try to read the page. In most cases it'll not yet exist. But the page > might concurrently have been written by another backend and removed > from shared buffers already. If already existing, goto 1) > 6) Zero out the page on disk. > I think this does handle the concurrency issues. The need for (5) kind of destroys my faith in this really being safe: it says there are non-obvious race conditions here. For instance, what about this scenario: * Session 1 tries to extend file, allocates buffer for page 42 (so it's now between steps 4 and 5). * Session 2 tries to extend file, sees buffer for 42 exists, allocates buffer for page 43 (so it's also now between 4 and 5). * Session 2 tries to read page 43, sees it's not there, and writes out page 43 with zeroes (now it's done). * Session 1 tries to read page 42, sees it's there and zero-filled (not because anybody wrote it, but because holes in files read as 0). At this point session 1 will go and create page 44, won't it, and you just wasted a page. Now we do have mechanisms for reclaiming such pages but they may not kick in until VACUUM, so you could end up with a whole lot of table bloat. Also, the file is likely to end up badly physically fragmented when the skipped pages are finally filled in. One of the good things about the relation extension lock is that the kernel sees the file as being extended strictly sequentially, which it should handle fairly well as far as filesystem layout goes. This way might end up creating a mess on-disk. Perhaps even more to the point, you've added a read() kernel call that was previously not there at all, without having removed either the lseek() or the write(). Perhaps that scales better when what you're measuring is saturation conditions on a many-core machine, but I have a very hard time believing that it's not a significant net loss under less-contended conditions. I'm inclined to think that a better solution in the long run is to keep the relation extension lock but find a way to extend files more than one page per lock acquisition. regards, tom lane
pgsql-hackers by date: