Re: On-disk Tuple Size - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: On-disk Tuple Size |
Date | |
Msg-id | 23713.1019318231@sss.pgh.pa.us Whole thread Raw |
In response to | On-disk Tuple Size (Curt Sampson <cjs@cynic.net>) |
Responses |
Re: On-disk Tuple Size
Re: On-disk Tuple Size |
List | pgsql-hackers |
Curt Sampson <cjs@cynic.net> writes: > While we're at it, would someone have the time to explain to me > how the on-disk CommandIds are used? To determine visibility of tuples for commands within a transaction. Just as you don't want your transaction's effects to become visible until you commit, you don't want an individual command's effects to become visible until you do CommandCounterIncrement. Among other things this solves the Halloween problem for us (how do you stop an UPDATE from trying to re-update the tuples it's already emitted, should it chance to hit them during its table scan). The command IDs aren't interesting anymore once the originating transaction is over, but I don't see a realistic way to recycle the space ... >> I believe we do want to distinguish three states: live tuple, dead >> tuple, and empty space. Otherwise there will be cases where you're >> forced to move data immediately to collapse empty space, when there's >> not a good reason to except that your representation can't cope. > I don't understand this. I thought more about this in the shower this morning, and realized the fundamental drawback of the scheme you are suggesting: it requires the line pointers and physical storage to be in the same order. (Or you could make it work in reverse order, by looking at the prior pointer instead of the next one to determine item size; that would actually work a little better. But in any case line pointer order and physical storage order are tied together.) This is clearly a loser for index pages: most inserts would require a data shuffle. But it is also a loser for heap pages, and the reason is that on heap pages we cannot change a tuple's index (line pointer number) once it's been created. If we did, it'd invalidate CTID forward links, index entries, and heapscan cursor positions for open scans. Indeed, pretty much the whole point of having the line pointers is to provide a stable ID for a tuple --- if we didn't need that we could just walk through the physical storage. When VACUUM removes a dead tuple, it compacts out the physical space and marks the line pointer as unused. (Of course, it makes sure all references to the tuple are gone first.) The next time we want to insert a tuple on that page, we can recycle the unused line pointer instead of allocating a new one from the end of the line pointer array. However, the physical space for the new tuple should come from the main free-space pool in the middle of the page. To implement the pointers-without-sizes representation, we'd be forced to shuffle data to make room for the tuple between the two adjacent-by-line-number tuples. The three states of a line pointer that I referred to are live (pointing at a good tuple), dead (pointing at storage that used to contain a good tuple, doesn't anymore, but hasn't been compacted out yet), and empty (doesn't point at storage at all; the space it used to describe has been merged into the middle-of-the-page free pool). ISTM a pointers-only representation can handle the live and dead cases nicely, but the empty case is going to be a real headache. In short, a pointers-only representation would give us a lot less flexibility in free space management. It's an interesting idea but I doubt that saving two bytes per row is worth the extra overhead. regards, tom lane
pgsql-hackers by date: