Thread: postgres inode q's
I apologize if this is the wrong group for this message, but I'm not sure where else this would go. I don't have a specific problem, but I would like to ask some questions about how postgres works. But first, some backfground info: I have two identical servers each running postgres 6.5.1 and each has an identical database called zipfind. This is a pretty static, mostly read only database with 700,000 rows. A few days ago I got some updated information for the database, 1,400,000 rows worth, almost double the data in ascii format. So, I got the new rows inserted with a perl script which read the ascii file line by line and inserted the data. This took quite a while, in fact, it took more than 24 hours. So, I decided I would update the second database in a different way. I realized I could pg_dump the new zipfind database, and read it back in using psql on the other machine, but I decided to try it a little differently, just to see what would happen. What I tried was to move the actual data files in the data/base/zipfind directory from the newly updated database directly to the machine still in need of updating. I shutdown postmaster on the machine that I was moving the files to, replaced all of the files in the zipfind directory with the files from the machine with all the new rows, reset all the permissions, and restarted postmaster. The strange thing is, even though the old files were removed and replaced with the new files using identical file names, psql seemed to be reading data from the old database as if it had not been removed. issuing a "select count(*) from zips;" returned the old row count 666730 instead of the new row count ca 1400000 ... if anything I expected to get some kind of error ..not the old row count! I checked the filesizes in the zipfind directory to make sure I hadn't made a mistake while putting the new data in place. Everything was correct. I then vacuumed the database and rechecked the file sizes. .. the "zips" table entry now reported the old file size! It occurred to me that there may be some system tables which were causing the erratic behaviour, I searched for something relevant but found nothing. The only theory that I could come up with was that postgres latched on to an inode for the original files ..but how it would keep that inode info across daemon invocations seems a mystery to me. Explanations appreciated! Thanks, Bryan -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==---------- http://www.newsfeeds.com The LargestUsenet Servers in the World! ------== Over 73,000 Newsgroups - Including Dedicated Binaries Servers ==-----
Bryan Ingram wrote: > I apologize if this is the wrong group for this message, but I'm not > sure where else this would go. Topshot - absolutely the right group. > I don't have a specific problem, but I would like to ask some questions > about how postgres works. > > But first, some backfground info: > I have two identical servers each running postgres 6.5.1 and each has an > identical database called zipfind. This is a pretty static, mostly read > ... > It occurred to me that there may be some system tables which were > causing the erratic behaviour, I searched for something relevant but > found nothing. Warm, warm, hot - missed! > The only theory that I could come up with was that postgres latched on > to an inode for the original files ..but how it would keep that inode > info across daemon invocations seems a mystery to me. Deep frozen :-) I assume from this description, that one of the servers is created more or less by a similar copy operation, but that time it was the entire ./data directory that got copied, or maybe the entire installation directory - right? If not, the two installations must have been treated absolutely identical until all the data was inserted into the zipfind databases. Anyway, the system file causing this is pg_log. It's not a table, it's a bitmap telling which transaction have committed and which ones not. There are some transaction ID fields in the header information of each data tuple in PostgreSQL. One tells in which transaction this tuple appeared, and the other when it disappeared. But they are ignored if the transaction in question isn't marked as committed in pg_log. So on a DELETE operation, the deleted tuples simply get the DELETE's transaction ID stamped into the ending field, and on an UPDATE, the same is done and a new tuple with this XID as the beginning is appended at the end of the table. Can you imagine now, what a ROLLBACK in PostgreSQL means? Simple - eh? Just mark the transaction in pg_log as rolled back and the stamps will get ignored. So the old tuple is still valid and the new tuple at the end is ignored. Vacuum now is the utility, that (dramatically simplified) whipes out all the tuples with a committed XID in the ending field and truncates the datafile. Since you didn't copy pg_log (AND DON'T DO SO, IT WOULD CORRUPT ALL DATABASES IN THE INSTALLATION) from PostgreSQL's point of view all the UPDATES/INSERTS found in the copied zipfind database files never committed, so the where ignored. Either you copy the entire ./data directory, or you do it with pg_dump. No other chance. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Jan, Thanks for the explanation, that does help to explain, and adds a lot to my postgres knowledge in general .. Based on your explanation, I understand how running VACUUM wiped out the new tuples that did not have a corresponding XID in pg_log. However, there is one aspect of this I still do not quite grasp .. What happens if the INSERT/DELETE is done without a transaction (BEGIN/COMMIT)? Is an XID still generated for that particular tuple, or is the tuple instantly commited with no XID stamped into the beginning/ending fields? Also, I don't understand why vacuum didn't wipe out all tuples in the database, rather than just the new ones. Here's why: When I updated the "new" database with the new records I used the DELETE then INSERT trick to avoid having to write logic to first see if there was an existing record and then to update only the changing fields. Since I actually deleted, then inserted, I'm guessing that the XID would change so that when I moved the database over to the other server, ALL of the XIDs would be different, not just the newly added rows. In which case, I would expect VACUUM to wipe everything. Instead, it only wiped the new rows, which tells me that even though I DELETED/INSERTED all existing rows, that somehow the XID's still sync with the XID's on the other server. Assuming the XIDs did change, I'd guess that though I had exactly the same number of rows I started with (666730 instead of +1400000) it is because the XIDs happened to correspond, but not necessarily with their original relationships. Which would mean that I still had 666730 rows, but not the original ones. Probably a smattering of new and old ones. I'm just theorizing off of the top of my head .. please let me know where I have gone wrong! Much Thanks, Bryan -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==---------- http://www.newsfeeds.com The LargestUsenet Servers in the World! ------== Over 73,000 Newsgroups - Including Dedicated Binaries Servers ==-----
Bryan Ingram <bingram@cpsgroup.com> writes: > What happens if the INSERT/DELETE is done without a transaction > (BEGIN/COMMIT)? Is an XID still generated for that particular tuple, > or is the tuple instantly commited with no XID stamped into the > beginning/ending fields? There is always a transaction. Postgres effectively generates an implict BEGIN and END around any query that's not inside an explicit transaction block. This is why failing statements don't cause trouble; their transactions get aborted. > When I updated the "new" database with the new records I used the DELETE then > INSERT trick to avoid having to write logic to first see if there was an > existing record and then to update only the changing fields. Since I actually > deleted, then inserted, I'm guessing that the XID would change so that when I > moved the database over to the other server, ALL of the XIDs would be > different, not just the newly added rows. In which case, I would expect > VACUUM to wipe everything. Instead, it only wiped the new rows, which tells > me that even though I DELETED/INSERTED all existing rows, that somehow the > XID's still sync with the XID's on the other server. Yeah, but the old tuples are *still there*. They are marked as having been deleted by transaction XID so-and-so. When you moved the files, those transaction numbers are no longer thought to be committed, so the old tuples come back to life (just as the new tuples are no longer considered valid, because their inserting transaction is not known to be committed). There is a potential hole in this theory, which relates to a point Jan didn't make in his otherwise excellent discussion. A tuple normally doesn't stay marked with its creating or deleting XID number for all that long, because we don't really want to pay the overhead of consulting pg_log for every single tuple. So, as soon as any backend checks a tuple and sees that its inserting transaction did commit, it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which is represented by inserting XID = 0 or some such). After that, no one has to check pg_log anymore for that tuple; it's good. Similarly, the deleting XID only stays on the tuple until someone verifies that the deleting transaction committed; after that the tuple is marked KNOWN DEAD, and it'll stay dead no matter what's in pg_log. VACUUM is really only special in that it reclaims space occupied by known-dead tuples; when it checks/updates the state of a tuple, it's not doing anything that's not done by a plain SELECT. So, AFAICT, you could only have seen the problem for tuples that were not scanned by any SELECT or UPDATE operation subsequent to having been inserted/deleted and committed. If you did all the deletes/inserts inside a transaction, committed, and then immediately copied the files, then for sure you'd have gotten burnt. If you did any sort of SELECT from the table after committing the changes, I'd have expected the tuple states to get frozen --- at least for the tuples that SELECT visited, which might not have been all of them if the SELECT was able to use an index. regards, tom lane
Tom Lane wrote: > > ................ So, as soon as any backend > checks a tuple and sees that its inserting transaction did commit, > it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which > is represented by inserting XID = 0 or some such). ......... > The way concurrency is supported in PostgreSQL is really cool, and I think not widely understood. The tuple uses flags stored in the t_infomask field of the HeapTupleHeader structure to 'cache' the status of the creating and deleting transactions for each tuple. Check out backend/utils/time/tqual.c and include/utils/tqual.h for the details of the algorithms. (Not recommended if you have been drinking at all) Ullman "Principles of Database and Knowledge-Base Systems, Vol 1" Has a pretty good discussion of time based and lock based schemes for concurrency control. Bernie Frankpitt
Tom Lane wrote: > Yeah, but the old tuples are *still there*. They are marked as having > been deleted by transaction XID so-and-so. When you moved the files, > those transaction numbers are no longer thought to be committed, so > the old tuples come back to life (just as the new tuples are no longer > considered valid, because their inserting transaction is not known to > be committed). > > There is a potential hole in this theory, which relates to a point Jan > didn't make in his otherwise excellent discussion. A tuple normally > doesn't stay marked with its creating or deleting XID number for all > that long, because we don't really want to pay the overhead of > consulting pg_log for every single tuple. So, as soon as any backend > checks a tuple and sees that its inserting transaction did commit, > it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which > is represented by inserting XID = 0 or some such). After that, no one > has to check pg_log anymore for that tuple; it's good. Similarly, the > deleting XID only stays on the tuple until someone verifies that the > deleting transaction committed; after that the tuple is marked KNOWN > DEAD, and it'll stay dead no matter what's in pg_log. VACUUM is really > only special in that it reclaims space occupied by known-dead tuples; > when it checks/updates the state of a tuple, it's not doing anything > that's not done by a plain SELECT. > > So, AFAICT, you could only have seen the problem for tuples that were > not scanned by any SELECT or UPDATE operation subsequent to having been > inserted/deleted and committed. If you did all the deletes/inserts > inside a transaction, committed, and then immediately copied the files, > then for sure you'd have gotten burnt. If you did any sort of SELECT > from the table after committing the changes, I'd have expected the tuple > states to get frozen --- at least for the tuples that SELECT visited, > which might not have been all of them if the SELECT was able to use an > index. Sounds like good material for the manual... and the book. -------- Regards Theo