Thread: postgres inode q's

postgres inode q's

From

Bryan Ingram

Date:

22 October 1999, 16:14:53

I apologize if this is the wrong group for this message, but I'm not
sure where else this would go.

I don't have a specific problem, but I would like to ask some questions
about how postgres works.

But first, some backfground info:
I have two identical servers each running postgres 6.5.1 and each has an
identical database called zipfind. This is a pretty static, mostly read
only database with 700,000 rows. A few days ago I got some updated
information for the database, 1,400,000 rows worth, almost double the
data in ascii format.

So, I got the new rows inserted with a perl script which read the ascii
file line by line and inserted the data. This took quite a while, in
fact, it took more than 24 hours. So, I decided I would update the
second database in a different way.

I realized I could pg_dump the new zipfind database, and read it back in
using psql on the other machine, but I decided to try it a little
differently, just to see what would happen.

What I tried was to move the actual data files in the data/base/zipfind
directory from the newly updated database directly to the machine still
in need of updating. I shutdown postmaster on the machine that I was
moving the files to, replaced all of the files in the zipfind directory
with the files from the machine with all the new rows, reset all the
permissions, and restarted postmaster.

The strange thing is, even though the old files were removed and
replaced with the new files using identical file names, psql seemed to
be reading data from the old database as if it had not been removed.
issuing a "select count(*) from zips;" returned the old row count 666730
instead of the new row count ca 1400000 ... if anything I expected to
get some kind of error ..not the old row count!

I checked the filesizes in the zipfind directory to make sure I hadn't
made a mistake while putting the new data in place. Everything was
correct. I then vacuumed the database and rechecked the file sizes. ..
the "zips" table entry now reported the old file size!

It occurred to me that there may be some system tables which were
causing the erratic behaviour, I searched for something relevant but
found nothing.

The only theory that I could come up with was that postgres latched on
to an inode for the original files ..but how it would keep that inode
info across daemon invocations seems a mystery to me.

Explanations appreciated!

Thanks,
Bryan

-----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==---------- http://www.newsfeeds.com The
LargestUsenet Servers in the World!

------== Over 73,000 Newsgroups - Including Dedicated Binaries Servers ==-----

Re: [HACKERS] postgres inode q's

From

wieck@debis.com (Jan Wieck)

Date:

22 October 1999, 17:13:54

Bryan Ingram wrote:

> I apologize if this is the wrong group for this message, but I'm not
> sure where else this would go.

    Topshot - absolutely the right group.

> I don't have a specific problem, but I would like to ask some questions
> about how postgres works.
>
> But first, some backfground info:
> I have two identical servers each running postgres 6.5.1 and each has an
> identical database called zipfind.  This is a pretty static, mostly read
> ...

> It occurred to me that there may be some system tables which were
> causing the erratic behaviour, I searched for something relevant but
> found nothing.

    Warm, warm, hot - missed!

> The only theory that I could come up with was that postgres latched on
> to an inode for the original files ..but how it would keep that inode
> info across daemon invocations seems a mystery to me.

    Deep frozen :-)

    I  assume  from  this description, that one of the servers is
    created more or less by a similar copy  operation,  but  that
    time  it  was the entire ./data directory that got copied, or
    maybe the entire installation directory - right?  If not, the
    two installations must have been treated absolutely identical
    until all the data was inserted into the zipfind databases.

    Anyway, the system file causing this is pg_log.  It's  not  a
    table, it's a bitmap telling which transaction have committed
    and which ones not.  There are some transaction ID fields  in
    the  header information of each data tuple in PostgreSQL. One
    tells in which transaction this tuple appeared, and the other
    when  it disappeared. But they are ignored if the transaction
    in question isn't marked as committed  in  pg_log.  So  on  a
    DELETE  operation, the deleted tuples simply get the DELETE's
    transaction ID stamped into  the  ending  field,  and  on  an
    UPDATE, the same is done and a new tuple with this XID as the
    beginning is appended at  the  end  of  the  table.  Can  you
    imagine  now,  what  a ROLLBACK in PostgreSQL means? Simple -
    eh? Just mark the transaction in pg_log as  rolled  back  and
    the  stamps will get ignored. So the old tuple is still valid
    and the new tuple at the end is ignored.

    Vacuum now is the  utility,  that  (dramatically  simplified)
    whipes  out all the tuples with a committed XID in the ending
    field and truncates the datafile.

    Since you didn't copy pg_log  (AND  DON'T  DO  SO,  IT  WOULD
    CORRUPT  ALL DATABASES IN THE INSTALLATION) from PostgreSQL's
    point of view all the UPDATES/INSERTS  found  in  the  copied
    zipfind database files never committed, so the where ignored.

    Either you copy the entire ./data directory,  or  you  do  it
    with pg_dump.  No other chance.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] postgres inode q's

From

Bryan Ingram

Date:

22 October 1999, 18:30:55

Jan,

Thanks for the explanation, that does help to explain, and adds a lot to my
postgres knowledge in general ..

Based on your explanation, I understand how running VACUUM wiped out the new
tuples that did not have a corresponding XID in pg_log.

However, there is one aspect of this I still do not quite grasp ..

What happens if the INSERT/DELETE is done without a transaction
(BEGIN/COMMIT)? Is an XID still generated for that particular tuple, or is the
tuple instantly commited with no XID stamped into the beginning/ending fields?

Also, I don't understand why vacuum didn't wipe out all tuples in the
database, rather than just the new ones. Here's why:

When I updated the "new" database with the new records I used the DELETE then
INSERT trick to avoid having to write logic to first see if there was an
existing record and then to update only the changing fields. Since I actually
deleted, then inserted, I'm guessing that the XID would change so that when I
moved the database over to the other server, ALL of the XIDs would be
different, not just the newly added rows. In which case, I would expect
VACUUM to wipe everything. Instead, it only wiped the new rows, which tells
me that even though I DELETED/INSERTED all existing rows, that somehow the
XID's still sync with the XID's on the other server.

Assuming the XIDs did change, I'd guess that though I had exactly the same
number of rows I started with (666730 instead of +1400000) it is because the
XIDs happened to correspond, but not necessarily with their original
relationships. Which would mean that I still had 666730 rows, but not the
original ones. Probably a smattering of new and old ones.

I'm just theorizing off of the top of my head .. please let me know where I
have gone wrong!

Much Thanks,
Bryan

-----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==---------- http://www.newsfeeds.com The
LargestUsenet Servers in the World!

------== Over 73,000 Newsgroups - Including Dedicated Binaries Servers ==-----

Re: [HACKERS] postgres inode q's

From

Tom Lane

Date:

22 October 1999, 21:16:56

Bryan Ingram <bingram@cpsgroup.com> writes:
> What happens if the INSERT/DELETE is done without a transaction
> (BEGIN/COMMIT)? Is an XID still generated for that particular tuple,
> or is the tuple instantly commited with no XID stamped into the
> beginning/ending fields?

There is always a transaction.  Postgres effectively generates an
implict BEGIN and END around any query that's not inside an explicit
transaction block.  This is why failing statements don't cause trouble;
their transactions get aborted.

> When I updated the "new" database with the new records I used the  DELETE then
> INSERT trick to avoid having to write logic to first see if there was an
> existing record and then to update only the changing fields.  Since I actually
> deleted, then inserted,  I'm guessing that the XID would change so that when I
> moved the database over to the other server, ALL of the XIDs would be
> different, not just the newly added rows.  In which case, I would expect
> VACUUM to wipe everything.  Instead, it only wiped the new rows, which tells
> me that even though I DELETED/INSERTED all existing rows, that somehow the
> XID's still sync with the XID's on the other server.

Yeah, but the old tuples are *still there*.  They are marked as having
been deleted by transaction XID so-and-so.  When you moved the files,
those transaction numbers are no longer thought to be committed, so
the old tuples come back to life (just as the new tuples are no longer
considered valid, because their inserting transaction is not known to
be committed).

There is a potential hole in this theory, which relates to a point Jan
didn't make in his otherwise excellent discussion.  A tuple normally
doesn't stay marked with its creating or deleting XID number for all
that long, because we don't really want to pay the overhead of
consulting pg_log for every single tuple.  So, as soon as any backend
checks a tuple and sees that its inserting transaction did commit,
it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which
is represented by inserting XID = 0 or some such).  After that, no one
has to check pg_log anymore for that tuple; it's good.  Similarly, the
deleting XID only stays on the tuple until someone verifies that the
deleting transaction committed; after that the tuple is marked KNOWN
DEAD, and it'll stay dead no matter what's in pg_log.  VACUUM is really
only special in that it reclaims space occupied by known-dead tuples;
when it checks/updates the state of a tuple, it's not doing anything
that's not done by a plain SELECT.

So, AFAICT, you could only have seen the problem for tuples that were
not scanned by any SELECT or UPDATE operation subsequent to having been
inserted/deleted and committed.  If you did all the deletes/inserts
inside a transaction, committed, and then immediately copied the files,
then for sure you'd have gotten burnt.  If you did any sort of SELECT
from the table after committing the changes, I'd have expected the tuple
states to get frozen --- at least for the tuples that SELECT visited,
which might not have been all of them if the SELECT was able to use an
index.
        regards, tom lane

Re: [HACKERS] postgres inode q's

From

Bernard Frankpitt

Date:

23 October 1999, 13:26:07

Tom Lane wrote:
> 
> ................ So, as soon as any backend
> checks a tuple and sees that its inserting transaction did commit,
> it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which
> is represented by inserting XID = 0 or some such). .........
> 

The way concurrency is supported in PostgreSQL is really cool, and I
think not widely understood. The tuple uses flags stored in the
t_infomask field of the HeapTupleHeader structure to 'cache' the status
of the creating and deleting transactions for each tuple. 

Check out backend/utils/time/tqual.c  and  include/utils/tqual.h  for
the details of the algorithms.  (Not recommended if you have been
drinking at all)

Ullman "Principles of Database and Knowledge-Base Systems, Vol 1" Has a
pretty good discussion of time based and lock based schemes for
concurrency control.
Bernie Frankpitt

Re: [HACKERS] postgres inode q's

From

Theo Kramer

Date:

30 October 1999, 08:13:16

Tom Lane wrote:
> Yeah, but the old tuples are *still there*.  They are marked as having
> been deleted by transaction XID so-and-so.  When you moved the files,
> those transaction numbers are no longer thought to be committed, so
> the old tuples come back to life (just as the new tuples are no longer
> considered valid, because their inserting transaction is not known to
> be committed).
> 
> There is a potential hole in this theory, which relates to a point Jan
> didn't make in his otherwise excellent discussion.  A tuple normally
> doesn't stay marked with its creating or deleting XID number for all
> that long, because we don't really want to pay the overhead of
> consulting pg_log for every single tuple.  So, as soon as any backend
> checks a tuple and sees that its inserting transaction did commit,
> it rewrites the tuple with a new state "INSERT KNOWN COMMITTED" (which
> is represented by inserting XID = 0 or some such).  After that, no one
> has to check pg_log anymore for that tuple; it's good.  Similarly, the
> deleting XID only stays on the tuple until someone verifies that the
> deleting transaction committed; after that the tuple is marked KNOWN
> DEAD, and it'll stay dead no matter what's in pg_log.  VACUUM is really
> only special in that it reclaims space occupied by known-dead tuples;
> when it checks/updates the state of a tuple, it's not doing anything
> that's not done by a plain SELECT.
> 
> So, AFAICT, you could only have seen the problem for tuples that were
> not scanned by any SELECT or UPDATE operation subsequent to having been
> inserted/deleted and committed.  If you did all the deletes/inserts
> inside a transaction, committed, and then immediately copied the files,
> then for sure you'd have gotten burnt.  If you did any sort of SELECT
> from the table after committing the changes, I'd have expected the tuple
> states to get frozen --- at least for the tuples that SELECT visited,
> which might not have been all of them if the SELECT was able to use an
> index.

Sounds like good material for the manual... and the book.
--------
Regards
Theo