Re: [HACKERS] UNDO and in-place update - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [HACKERS] UNDO and in-place update |
Date | |
Msg-id | CAA4eK1LgJE+Z9y7Far3Ngponf4nF0M-Ch0zS2xyyRhyNhUtJJw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] UNDO and in-place update (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] UNDO and in-place update
|
List | pgsql-hackers |
On Tue, Jan 10, 2017 at 7:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> Yes, something like this can be done. You don't really need any new >> page-level header data, because you can get the XIDs from the TPD >> entry (or from the page itself if there's only one). But you could >> expand the single "is-modified" bit that I've proposed adding to each >> tuple to multiple bits. 0 means not recently modified. 1 means >> modified by the first or only transaction that has recently modified >> the page. 2 means modified by the second transaction that has >> recently modified the page. Etc. >> > > makes sense. > >> What I was thinking about doing instead is storing an array in the TPD >> containing the same information. There would be one byte or one half >> a byte or whatever per TID and it would contain the index of the XID >> in the TPD that had most recently modified or locked that TID. Your >> solution might be better, though, at least for cases where the number >> of tuples that have modified the page is small. >> > > I think we also need to prevent multiple backends trying to reserve a > slot in this array which can be a point of contention. Another point > is during pruning, if due to row movement TIDs are changed, we need to > keep this array in sync. > Moving further, I have thought about consistent reads and different formats for storing undo with pros and cons of each format. Consistent Reads ---------------------------- If the page is modified by any transaction that is not visible to read transaction's snapshot, we have to reconstruct a copy of the page. Now there could be multiple ways to achieve it: 1. Apply the undo for the transactions that are not visible to read transaction and keep the copy of the page. 2. Apply the undo for all the transactions in TPD entry of a page. I think we can't do second because this can create rows which are not consistent due to apply of undo log for transactions which are visible to read transaction. Now, where the first approach will work, but it might turn out to be inefficient if not implemented carefully, as for each read transaction we need to create separate copies of the page where a different set of transactions are not visible to different read transactions. Even for read transactions to which the same set of transactions are not visible, we need some way to recognize the version of the page that can be used, probably try with all the version of pages and see if any of the existing versions can serve the purpose. I think we also need some way to link the different versions of the page in shared buffers so that before trying to create a new version of the page we can look at linked versions. Yet another consideration in this area could be to perform consistent reads differently for index scans. For index scan, we will receive a particular TID to read, so, in this case, we can try to reconstruct the old image of that particular row. Now, whereas this sounds efficient, but we might need to reconstruct the old image of rows multiple times for different read transactions. The effect will be magnificent when we have to perform this for multiple rows of a page. I think it might be better to always construct a complete copy of page if one or more transactions that have changed the page are not visible to read snapshot. Undo Page Format ------------------------------- I think we can organize undo in multiple ways (a) One undo file per transaction - We can name each undo file as epoch+xid or epch+xid.undo. These files can be kept in pg_undo directory under PGDATA. In this approach, we have to store latest_undo offset in file header to perform the rollback operation. Also, this offset can be used to allocate the location for new undo record. Apart from that undo entries can be organized one after another. Each undo entry will have two back pointers, one to reach the previous undo location which will be used for rollback and the second one to maintain the prior undo chain for changes to a particular page which will be used to reconstruct the copy of page. During recovery, we need to read each undo file and check the transaction status in 'clog', if it is not committed, then apply all the undo starting from latest_undo position. I think we should organize undo's in form of pages with a size similar to heap pages. This will be helpful both for writing WAL and doing checkpoints for undo where we expect pages. The access to UNDO can be via shared buffers. I think we might want to keep the access to shared buffers containing undo separate from heap/index pages so that they can't be evicted based on access of heap pages. Different TPD entries containing the same XID will point to different locations in undo. Undo location will be just a page offset for a particular page entry. Vacuum can delete or make the undo file reusable when the corresponding transaction precedes RecentGlobalXmin. To decide which files can be removed/reused, Vacuum either needs to traverse the TPD entries or need to find by traversing the files in pg_undo directory. Here, along with the removal of undo file, we also need some way to indicate that TPD entry/entries for that undo are invalid. A simple way could be that if the corresponding undo file doesn't exist, then we can consider it as invalid. However, that might not work if we choose to store undo of multiple transactions in one undo file. The main advantage of this approach is that this will allow removing the *not* required undo space efficiently. The disadvantage is that creating different files for each transaction can be costly and it can lead to lot of undo files when there are some long running transactions, of course, we can optimize by allowing to remove the files in such cases and give user error "snapshot_too_old", but I think still it can be a matter of concern. To alleviate this concern, we can try to keep multiple transactions undo data in one file. (b) Multiple transactions in one undo file - Here we have to choose some arbitrary name for undo files, let's say undo_1, undo_2, etc. for the sake of discussion. These files can be kept in pg_undo directory under PGDATA. In this approach, we can choose to select set of four or eight 8KB pages (page of the same size as heap page) per transaction, we can call each such set of pages as one undo segment. If one segment is not sufficient to store undo for one transaction, then we can allocate next segment. Each segment can have segment header in the first page which will contain previous segment address. For the very first segment, previous segment address will be Nil, the second segment will contain the address of the first segment, third, will contain the address of second and so on. Previous segment address is needed to replay undo for rollback. I have explicitly avoided storing the next segment address as we don't need it. We also need some form of freespace map to keep track of undo segments that are freed. Once the vacuum decided to free the undo of a particular transaction, it can add all the undo segments for that particular transaction in freespace map. For allocating a new segment for existing transaction or new transaction, we need to first search in freespace map for the available segment, if not available, then allocate a new segment. In the file, the first segment will contain slots to hold the latest_undo location and XID (probably (epoch + XID)) for each of the transaction which has undo in the file. We need latest_undo location to rollback the transaction and XID is required to rollback the transaction at the time of recovery. Let's call the first segment as a HEAD segment. We can extend this HEAD segment if there is more space in the file, but for simplicity and matter of discussion, let's consider one undo file can hold the transactions that can get a slot in the HEAD segment. Each slot can be 6 or 8 bytes long. Considering each segment of size 32KB, we can hold undo of ~4096 transactions in one file. We can keep some book-keeping information to examine whether there is any empty slot in the first segment of the file. So whenever a new transaction has to allocate an undo, it checks in a HEAD segment of undo_1, if there is a space to allocate new slot, allocate it, else try in next file undo_2. We do need to consider what to do when the undo file is chosen to write undo for a transaction is finished and transaction still has to write more data. There are a couple of options like we can first try to free the space for any existing transaction which is already committed and if there is no more freespace available we can either choose to wait or allocate an additional XID to write UNDO as proposed in the initial e-mail in this thread. Like the previous approach, undo can be accessed via shared buffers. Similarly, each undo entry will have two back pointers as described in the previous approach. There are some loose points like how freespace map will be organized which needs more thoughts, but I think the core of the idea can be made to work. This will overcome the disadvantages of previous approach (a) there could be too many undo files (b) the cost associated with creating and managing too many files. I think this might be slightly lesser efficient as compared to the previous approach in terms of purging the space for unused undo space as it can't return the freed space in undo file back to OS until all the transactions in undo file precedes RecentGlobalXmin. One disadvantage of this approach as compared to the previous approach is that there can be some contention in finding free undo slot in undo file, but I think the advantage it can bring by having lesser files to manage will surpass such a disadvantage. This still doesn't cover UNDO record contents for each operation. I think we can design that once the file format is fixed, although there is no strict dependency between the two. Thanks to Dilip Kumar for having offlist discussions on this topic to figure out some of the design points. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: