Thread: Large object loading stalls
Hi, We have been running into a strange problem on our production servers of a system we are about to roll out with the loading of large objects. At various times (usually separated by a few days), the loading process against the server will suddenly either take an extremely long time (going from the usual 15-20 minutes up to 45 or more minutes) or simply hang up altogether (i.e., when 8+ hours have gone by, we haven't had the patience to wait and see if it eventually finishes). This is on an 8.3.3 server running on debian etch - most of the previous testing occurred on 8.1.10, where we did not encounter this problem (though we had other issues on that platform). I have run some tests with the loading of large objects into a PostgreSQL database (8.3.3) using a more rapid turnaround (loading, deleting, vacuuming directly and then starting the loading over again) than in production and have been able to provoke what seems to be the same issue there - usually within 12 hours. Essentially, we have a C++ program (pqxx) loading a large number of large objects (somewhere around a 80,000 or so) exhibiting strange behavior. It varies quite a lot in loading time (from 1600 to 2500 seconds), and at random times it will simply just hang up (at least it seems so, given that it has not progressed at all after 7-8 hours). It displays this behaviour irrespective of whether we load all 80000 fields in one go, or the loading is broken up into smaller continuos bits of 1000 blobs at a time (though it does seem easier to provoke the problem with the former loading approach). It can run fine for hours at a time, and then suddenly lock up (in the latest test I did, it loaded the data correctly 22 times in a row, before locking up on the 23rd run through). The pattern seems to be identical each time; ps aux shows the process (wdb) idle in transaction, with no activity on the system being apparent: postgres 19965 0.0 0.0 72628 5584 ? S Feb17 0:01 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/postgresql/8.3/main -c config_file=/etc/postgresql/8.3/main/postgresql.conf postgres 19967 0.4 0.0 72748 27636 ? Ss Feb17 6:38 postgres: writer process postgres 19968 0.0 0.0 72628 1784 ? Ss Feb17 0:09 postgres: wal writer process postgres 19969 0.0 0.0 72940 1924 ? Ss Feb17 0:00 postgres: autovacuum launcher process postgres 19970 0.0 0.0 42352 1736 ? Ss Feb17 0:02 postgres: stats collector process postgres 16617 0.0 0.0 95156 10948 ? Ss 08:42 0:00 postgres: autovacuum worker process wdb postgres 18834 0.0 0.1 118632 37320 ? Ss 09:06 0:01 postgres: wdb wdb [local] idle in transaction A core dump on the program we did earlier seemed to indicate that the program was waiting on the database to return on the insert statement. top indicates that both postgres and the application are idling (no load on the server whatsoever). pg_locks and pg_stat_activity seem to indicate that there may perhaps be some issue with autovacuum? locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+--------- relation | 16385 | 2683 | | | | | | | | 2/1122117 | 16617 | RowExclusiveLock | t relation | 16385 | 10969 | | | | | | | | 3/1105 | 27895 | AccessShareLock | t relation | 16385 | 2613 | | | | | | | | 2/1122117 | 16617 | ShareUpdateExclusiveLock | t virtualxid | | | | | 3/1105 | | | | | 3/1105 | 27895 | ExclusiveLock | t virtualxid | | | | | 2/1122117 | | | | | 2/1122117 | 16617 | ExclusiveLock | t (5 rows) datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port -------+---------+---------+----------+----------+------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+------------- 16385 | wdb | 16617 | 10 | postgres | autovacuum: VACUUM ANALYZE pg_catalog.pg_largeobject | f | 2009-02-18 08:42:45.066922+00 | 2009-02-18 08:42:45.066922+00 | 2009-02-18 08:42:45.065857+00 | | 16385 | wdb | 27895 | 16384 | wdb | select * from pg_stat_activity ; | f | 2009-02-18 12:21:47.925813+00 | 2009-02-18 12:21:47.925813+00 | 2009-02-18 11:09:31.201507+00 | | -1 16385 | wdb | 18834 | 16384 | wdb | <IDLE> in transaction | f | | 2009-02-18 09:06:26.479633+00 | 2009-02-18 09:06:23.443698+00 | | -1 (3 rows) Any leads as to how to figure out the problem or alternatively how to work around it would be much appreciated. I noticed that the disk on the test machine seemed to be running low on disk caused by an error in the fsm settings, but this was not the case in the production systems, so I assume that this is not the problem here either (will fix it for future test runs, though). Regards, Michael Akinde
Attachment
Michael Akinde <michael.akinde@met.no> writes: > We have been running into a strange problem on our production servers of > a system we are about to roll out with the loading of large objects. Hmm, can you attach to the stuck backend and the vacuum worker process with gdb and get stack traces from them? The pg_locks view does not indicate any locking problem, but I'm wondering if there could be a deadlock at the LWLock level. regards, tom lane
Hej,
I killed the test about an hour ago to eliminate the free space stuff as an issue. I've started the process up again, so if (when) the test hangs again, I'll try to get the stack trace.
Regards,
Michael Akinde
Tom Lane wrote:
I killed the test about an hour ago to eliminate the free space stuff as an issue. I've started the process up again, so if (when) the test hangs again, I'll try to get the stack trace.
Regards,
Michael Akinde
Tom Lane wrote:
Michael Akinde <michael.akinde@met.no> writes:We have been running into a strange problem on our production servers of a system we are about to roll out with the loading of large objects.Hmm, can you attach to the stuck backend and the vacuum worker process with gdb and get stack traces from them? The pg_locks view does not indicate any locking problem, but I'm wondering if there could be a deadlock at the LWLock level. regards, tom lane
Attachment
Tom Lane wrote: > Hmm, can you attach to the stuck backend and the vacuum worker process > with gdb and get stack traces from them? The pg_locks view does not > indicate any locking problem, but I'm wondering if there could be a > deadlock at the LWLock level. My reply seems to have been lost in the ether. Anyway, I fixed the low fsm settings and managed to replicate the problem in two separate instances and the problem does not appear to be autovacuum, as I was able to observe the process hanging long after autovacuum has been released. Perhaps the vacuuming tasks were getting stuck before because of the too low fsm setting? Anyway - the situation now is that just the loading process is hanging on the server, with an <IDLE> in transaction. But it is definitely the loading program that is hanging, not the Postgres server. pg_locks 2701646 | wdb | 26359 | 2701645 | wdb | <IDLE> in transaction | f | | 2009-02-18 23:57:59.619868+00 | 2009-02-18 23:57:58.461848+00 | | -1 Backtrace from postgress process #0 0x00002ad9ed3fef15 in recv () from /lib/libc.so.6 #1 0x000000000053ba38 in secure_read () #2 0x0000000000542700 in pq_comm_reset () #3 0x0000000000542b47 in pq_getbyte () #4 0x00000000005b648d in prepare_for_client_read () #5 0x00000000005b6d7a in PostgresMain () #6 0x000000000058c34b in ClosePostmasterPorts () #7 0x000000000058d06e in PostmasterMain () #8 0x00000000005444f5 in main () Backtrace from gribLoad #0 0x00002b2ab43c2c8f in poll () from /lib/libc.so.6 #1 0x00002b2ab47cc4af in PQmblen () from /usr/lib/libpq.so.4 #2 0x00002b2ab47cc590 in pqWaitTimed () from /usr/lib/libpq.so.4 #3 0x00002b2ab47cbe72 in PQgetResult () from /usr/lib/libpq.so.4 #4 0x00002b2ab47cbf4e in PQgetResult () from /usr/lib/libpq.so.4 #5 0x00002b2ab32a0556 in pqxx::connection_base::prepared_exec () from /usr/lib/libpqxx-2.6.8.so #6 0x00002b2ab32be6ed in pqxx::transaction_base::prepared_exec () from /usr/lib/libpqxx-2.6.8.so #7 0x00002b2ab32b2486 in pqxx::prepare::invocation::exec () from /usr/lib/libpqxx-2.6.8.so #8 0x00002b2ab2d9b4cc in wdb::database::WriteValue::operator() () from /usr/lib/libwdbLoaderBase.so.0 #9 0x00002b2ab2da27d8 in pqxx::connection_base::perform<wdb::database::WriteValue> () from /usr/lib/libwdbLoaderBase.so.0 #10 0x00002b2ab2d99ddb in wdb::database::LoaderDatabaseConnection::loadField () from /usr/lib/libwdbLoaderBase.so.0 #11 0x00000000004182f0 in log4cpp::CategoryStream::operator<< <char [13]> () #12 0x00000000004073e8 in ?? () #13 0x000000000040819f in ?? () #14 0x00002b2ab431e4ca in __libc_start_main () from /lib/libc.so.6 #15 0x000000000040665a in ?? () #16 0x00007ffff7e3d6c8 in ?? () #17 0x0000000000000000 in ?? () Whatever weirdness happens appears to always occur at this point in the process (previous stacktraces we've done point to the same insert statement), but the timing is seemingly totally random (it can occur right away, or the loading can run dozens of times before getting stuck). I am rather at a loss to explain this. We've loaded literally millions of rows with this code, so the functionality is hardly untested. And is it something we are doing, or could we have hit upon some concurrency issue in pq or pqxx transactors? Any hints or tips to help identify the problem would be appreciated. Strangely, if one strace's into the loading process (not the postgres process), then the poll() call on which the process can have been hanging for hours will release and the process will just go on as if nothing has happened. Anyone seen stuff like this happen before? Regards, Michael A.
Attachment
Michael Akinde <michael.akinde@met.no> writes: > Anyway - the situation now is that just the loading process is hanging > on the server, with an <IDLE> in transaction. But it is definitely the > loading program that is hanging, not the Postgres server. What the stack traces seem to show is that both the client and the server are waiting for each other to send some data. Which means somebody's bollixed the protocol. In the past we've seen this type of thing caused by multithreaded client programs in which more than one thread tried to use the same PGconn object without adequate interlocking. libpq itself does not provide any threading guards --- if you want more than one thread accessing a PGconn then it's up to you to use a mutex or something to serialize them. Is it possible this case applies here? regards, tom lane
Tom Lane wrote:
Our application is single-threaded, so it seems unlikely that we are running into a problem with that.
The only thing I can think of right now, is that we are running a Postgres 8.3 on Debian Etch (so a backported debian package), whereas the libraries linked into our application are older library version (libpq4 and libpqxx 2.6.8). I'll try to upgrade the OS to a version with "native" support for 8.3 and up to date (or at least more up to date) versions of pq, pqxx and check whether the tests still break down.
Regards,
Michael A.
My apologies for the delayed response.Michael Akinde <michael.akinde@met.no> writes:Anyway - the situation now is that just the loading process is hanging on the server, with an <IDLE> in transaction. But it is definitely the loading program that is hanging, not the Postgres server.What the stack traces seem to show is that both the client and the server are waiting for each other to send some data. Which means somebody's bollixed the protocol. In the past we've seen this type of thing caused by multithreaded client programs in which more than one thread tried to use the same PGconn object without adequate interlocking. libpq itself does not provide any threading guards --- if you want more than one thread accessing a PGconn then it's up to you to use a mutex or something to serialize them. Is it possible this case applies here?
Our application is single-threaded, so it seems unlikely that we are running into a problem with that.
The only thing I can think of right now, is that we are running a Postgres 8.3 on Debian Etch (so a backported debian package), whereas the libraries linked into our application are older library version (libpq4 and libpqxx 2.6.8). I'll try to upgrade the OS to a version with "native" support for 8.3 and up to date (or at least more up to date) versions of pq, pqxx and check whether the tests still break down.
Regards,
Michael A.
Attachment
Michael Akinde <michael.akinde@met.no> writes: > Tom Lane wrote: >> In the past we've seen this type of thing caused by multithreaded >> client programs in which more than one thread tried to use the same >> PGconn object without adequate interlocking. > Our application is single-threaded, so it seems unlikely that we are > running into a problem with that. Well, maybe you've found an actual bug then; but without a test case that other people can poke at, it's hard to investigate. > The only thing I can think of right now, is that we are running a > Postgres 8.3 on Debian Etch (so a backported debian package), whereas > the libraries linked into our application are older library version > (libpq4 and libpqxx 2.6.8). Older libpq versions should work fine with an 8.3 server --- if they don't, that's a bug in itself. I do not know the status of libpqxx though; you might want to check whether there are known bugs in that version. regards, tom lane
Tom Lane wrote:
The test work at the moment is hampered by our server section needing to physically move some of our development & test servers this week, but if upgrading pqxx doesn't seem to eliminate the problem, I'll do some work to see if I can recreate the problem in an isolated setting.
Regards,
Michael A.
Michael Akinde <michael.akinde@met.no> writes:Tom Lane wrote:In the past we've seen this type of thing caused by multithreaded client programs in which more than one thread tried to use the same PGconn object without adequate interlocking.Our application is single-threaded, so it seems unlikely that we are running into a problem with that.Well, maybe you've found an actual bug then; but without a test case that other people can poke at, it's hard to investigate.
There are a few known bugs in pqxx that *might* relate to our code; for instance, we use prepared statements, and this is an area that has seen a good deal of work from 2.6.8 to 2.6.9.The only thing I can think of right now, is that we are running a Postgres 8.3 on Debian Etch (so a backported debian package), whereas the libraries linked into our application are older library version (libpq4 and libpqxx 2.6.8).Older libpq versions should work fine with an 8.3 server --- if they don't, that's a bug in itself. I do not know the status of libpqxx though; you might want to check whether there are known bugs in that version.
The test work at the moment is hampered by our server section needing to physically move some of our development & test servers this week, but if upgrading pqxx doesn't seem to eliminate the problem, I'll do some work to see if I can recreate the problem in an isolated setting.
Regards,
Michael A.