Thread: Large object loading stalls

Large object loading stalls

From

Michael Akinde

Date:

18 February 2009, 09:44:19

Hi,

We have been running into a strange problem on our production servers of
a system we are about to roll out with the loading of large objects. At
various times (usually separated by a few days), the loading process
against the server will suddenly either take an extremely long time
(going from the usual 15-20 minutes up to 45 or more minutes) or simply
hang up altogether (i.e., when 8+ hours have gone by, we haven't had the
patience to wait and see if it eventually finishes). This is on an 8.3.3
server running on debian etch - most of the previous testing occurred on
8.1.10, where we did not encounter this problem (though we had other
issues on that platform).

I have run some tests with the loading of large objects into a
PostgreSQL database (8.3.3) using a more rapid turnaround (loading,
deleting, vacuuming directly and then starting the loading over again)
than in production and have been able to provoke what seems to be the
same issue there - usually within 12 hours.

Essentially, we have a C++ program (pqxx) loading a large number of
large objects (somewhere around a 80,000 or so) exhibiting strange
behavior. It varies quite a lot in loading time (from 1600 to 2500
seconds), and at random times it will simply just hang up (at least it
seems so, given that it has not progressed at all after 7-8 hours). It
displays this behaviour irrespective of whether we load all 80000 fields
in one go, or the loading is broken up into smaller continuos bits of
1000 blobs at a time (though it does seem easier to provoke the problem
with the former loading approach). It can run fine for hours at a time,
and then suddenly lock up (in the latest test I did, it loaded the data
correctly 22 times in a row, before locking up on the 23rd run through).

The pattern seems to be identical each time; ps aux shows the process
(wdb) idle in transaction, with no activity on the system being apparent:

postgres 19965  0.0  0.0  72628  5584 ?        S    Feb17   0:01
/usr/lib/postgresql/8.3/bin/postgres -D /var/lib/postgresql/8.3/main -c
config_file=/etc/postgresql/8.3/main/postgresql.conf
postgres 19967  0.4  0.0  72748 27636 ?        Ss   Feb17   6:38
postgres: writer
process

postgres 19968  0.0  0.0  72628  1784 ?        Ss   Feb17   0:09
postgres: wal writer
process

postgres 19969  0.0  0.0  72940  1924 ?        Ss   Feb17   0:00
postgres: autovacuum launcher
process

postgres 19970  0.0  0.0  42352  1736 ?        Ss   Feb17   0:02
postgres: stats collector
process

postgres 16617  0.0  0.0  95156 10948 ?        Ss   08:42   0:00
postgres: autovacuum worker process
wdb

postgres 18834  0.0  0.1 118632 37320 ?        Ss   09:06   0:01
postgres: wdb wdb [local] idle in
transaction


A core dump on the program we did earlier seemed to indicate that the
program was waiting on the database to return on the insert statement.
top indicates that both postgres and the application are idling (no load
on the server whatsoever).

pg_locks and pg_stat_activity seem to indicate that there may perhaps be
some issue with autovacuum?

 locktype  | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid
|           mode           | granted

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------
 relation   |    16385 |     2683 |      |       |
|               |         |       |          | 2/1122117          |
16617 | RowExclusiveLock         | t
 relation   |    16385 |    10969 |      |       |
|               |         |       |          | 3/1105             |
27895 | AccessShareLock          | t
 relation   |    16385 |     2613 |      |       |
|               |         |       |          | 2/1122117          |
16617 | ShareUpdateExclusiveLock | t
 virtualxid |          |          |      |       | 3/1105
|               |         |       |          | 3/1105             |
27895 | ExclusiveLock            | t
 virtualxid |          |          |      |       | 2/1122117
|               |         |       |          | 2/1122117          |
16617 | ExclusiveLock            | t
(5 rows)

 datid | datname | procpid | usesysid | usename  |
current_query                     | waiting |
xact_start           |          query_start          |
backend_start         | client_addr | client_port

-------+---------+---------+----------+----------+------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------
 16385 | wdb     |   16617 |       10 | postgres | autovacuum: VACUUM
ANALYZE pg_catalog.pg_largeobject | f       | 2009-02-18
08:42:45.066922+00 | 2009-02-18 08:42:45.066922+00 | 2009-02-18
08:42:45.065857+00 |             |
 16385 | wdb     |   27895 |    16384 | wdb      | select * from
pg_stat_activity ;                     | f       | 2009-02-18
12:21:47.925813+00 | 2009-02-18 12:21:47.925813+00 | 2009-02-18
11:09:31.201507+00 |             |          -1
 16385 | wdb     |   18834 |    16384 | wdb      | <IDLE> in
transaction                                | f
|                               | 2009-02-18 09:06:26.479633+00 |
2009-02-18 09:06:23.443698+00 |             |          -1
(3 rows)

Any leads as to how to figure out the problem or alternatively how to
work around it would be much appreciated.

I noticed that the disk on the test machine seemed to be running low on
disk caused by an error in the fsm settings, but this was not the case
in the production systems, so I assume that this is not the problem here
either (will fix it for future test runs, though).

Regards,

Michael Akinde

Attachment

michael.akinde.vcf

Re: Large object loading stalls

From

Tom Lane

Date:

18 February 2009, 11:43:49

Michael Akinde <michael.akinde@met.no> writes:
> We have been running into a strange problem on our production servers of
> a system we are about to roll out with the loading of large objects.

Hmm, can you attach to the stuck backend and the vacuum worker process
with gdb and get stack traces from them?  The pg_locks view does not
indicate any locking problem, but I'm wondering if there could be a
deadlock at the LWLock level.

            regards, tom lane

Re: Large object loading stalls

From

Michael Akinde

Date:

18 February 2009, 12:01:47

Hej,

I killed the test about an hour ago to eliminate the free space stuff as an issue. I've started the process up again, so if (when) the test hangs again, I'll try to get the stack trace.

Regards,

Michael Akinde

Tom Lane wrote:

Michael Akinde <michael.akinde@met.no> writes:

We have been running into a strange problem on our production servers of 
a system we are about to roll out with the loading of large objects.

Hmm, can you attach to the stuck backend and the vacuum worker process
with gdb and get stack traces from them?  The pg_locks view does not
indicate any locking problem, but I'm wondering if there could be a
deadlock at the LWLock level.
		regards, tom lane

Attachment

michael.akinde.vcf

Re: Large object loading stalls

From

Michael Akinde

Date:

20 February 2009, 07:25:12

Tom Lane wrote:
> Hmm, can you attach to the stuck backend and the vacuum worker process
> with gdb and get stack traces from them?  The pg_locks view does not
> indicate any locking problem, but I'm wondering if there could be a
> deadlock at the LWLock level.
My reply seems to have been lost in the ether. Anyway, I fixed the low
fsm settings and managed to replicate the problem in two separate
instances and the problem does not appear to be autovacuum, as I was
able to observe the process hanging long after autovacuum has been
released. Perhaps the vacuuming tasks were getting stuck before because
of the too low fsm setting?

Anyway - the situation now is that just the loading process is hanging
on the server, with an <IDLE> in transaction. But it is definitely the
loading program that is hanging, not the Postgres server.

pg_locks

2701646 | wdb | 26359 | 2701645 | wdb | <IDLE> in transaction
| f | | 2009-02-18
23:57:59.619868+00 | 2009-02-18 23:57:58.461848+00 | |
-1

Backtrace from postgress process

#0 0x00002ad9ed3fef15 in recv () from /lib/libc.so.6
#1 0x000000000053ba38 in secure_read ()
#2 0x0000000000542700 in pq_comm_reset ()
#3 0x0000000000542b47 in pq_getbyte ()
#4 0x00000000005b648d in prepare_for_client_read ()
#5 0x00000000005b6d7a in PostgresMain ()
#6 0x000000000058c34b in ClosePostmasterPorts ()
#7 0x000000000058d06e in PostmasterMain ()
#8 0x00000000005444f5 in main ()

Backtrace from gribLoad

#0 0x00002b2ab43c2c8f in poll () from /lib/libc.so.6
#1 0x00002b2ab47cc4af in PQmblen () from /usr/lib/libpq.so.4
#2 0x00002b2ab47cc590 in pqWaitTimed () from /usr/lib/libpq.so.4
#3 0x00002b2ab47cbe72 in PQgetResult () from /usr/lib/libpq.so.4
#4 0x00002b2ab47cbf4e in PQgetResult () from /usr/lib/libpq.so.4
#5 0x00002b2ab32a0556 in pqxx::connection_base::prepared_exec () from
/usr/lib/libpqxx-2.6.8.so
#6 0x00002b2ab32be6ed in pqxx::transaction_base::prepared_exec () from
/usr/lib/libpqxx-2.6.8.so
#7 0x00002b2ab32b2486 in pqxx::prepare::invocation::exec () from
/usr/lib/libpqxx-2.6.8.so
#8 0x00002b2ab2d9b4cc in wdb::database::WriteValue::operator() () from
/usr/lib/libwdbLoaderBase.so.0
#9 0x00002b2ab2da27d8 in
pqxx::connection_base::perform<wdb::database::WriteValue> ()
from /usr/lib/libwdbLoaderBase.so.0
#10 0x00002b2ab2d99ddb in
wdb::database::LoaderDatabaseConnection::loadField () from
/usr/lib/libwdbLoaderBase.so.0
#11 0x00000000004182f0 in log4cpp::CategoryStream::operator<< <char [13]> ()
#12 0x00000000004073e8 in ?? ()
#13 0x000000000040819f in ?? ()
#14 0x00002b2ab431e4ca in __libc_start_main () from /lib/libc.so.6
#15 0x000000000040665a in ?? ()
#16 0x00007ffff7e3d6c8 in ?? ()
#17 0x0000000000000000 in ?? ()

Whatever weirdness happens appears to always occur at this point in the
process (previous stacktraces we've done point to the same insert
statement), but the timing is seemingly totally random (it can occur
right away, or the loading can run dozens of times before getting
stuck). I am rather at a loss to explain this. We've loaded literally
millions of rows with this code, so the functionality is hardly
untested. And is it something we are doing, or
could we have hit upon some concurrency issue in pq or pqxx transactors?
Any hints or tips to help identify the problem would be appreciated.

Strangely, if one strace's into the loading process (not the postgres
process), then the poll() call on which the process can have been
hanging for hours will release and the process will just go on as if
nothing has happened. Anyone seen stuff like this happen before?

Regards,

Michael A.

Attachment

michael.akinde.vcf

Re: Large object loading stalls

From

Tom Lane

Date:

20 February 2009, 11:18:05

Michael Akinde <michael.akinde@met.no> writes:
> Anyway - the situation now is that just the loading process is hanging
> on the server, with an <IDLE> in transaction. But it is definitely the
> loading program that is hanging, not the Postgres server.

What the stack traces seem to show is that both the client and the
server are waiting for each other to send some data.  Which means
somebody's bollixed the protocol.

In the past we've seen this type of thing caused by multithreaded
client programs in which more than one thread tried to use the same
PGconn object without adequate interlocking.  libpq itself does not
provide any threading guards --- if you want more than one thread
accessing a PGconn then it's up to you to use a mutex or something
to serialize them.  Is it possible this case applies here?

            regards, tom lane

Re: Large object loading stalls

From

Michael Akinde

Date:

24 February 2009, 06:18:22

Tom Lane wrote:

Michael Akinde <michael.akinde@met.no> writes:

Anyway - the situation now is that just the loading process is hanging 
on the server, with an <IDLE> in transaction. But it is definitely the 
loading program that is hanging, not the Postgres server.

What the stack traces seem to show is that both the client and the
server are waiting for each other to send some data.  Which means
somebody's bollixed the protocol.

In the past we've seen this type of thing caused by multithreaded
client programs in which more than one thread tried to use the same
PGconn object without adequate interlocking.  libpq itself does not
provide any threading guards --- if you want more than one thread
accessing a PGconn then it's up to you to use a mutex or something
to serialize them.  Is it possible this case applies here?

My apologies for the delayed response.

Our application is single-threaded, so it seems unlikely that we are running into a problem with that.

The only thing I can think of right now, is that we are running a Postgres 8.3 on Debian Etch (so a backported debian package), whereas the libraries linked into our application are older library version (libpq4 and libpqxx 2.6.8). I'll try to upgrade the OS to a version with "native" support for 8.3 and up to date (or at least more up to date) versions of pq, pqxx and check whether the tests still break down.

Regards,

Michael A.

Attachment

michael.akinde.vcf

Re: Large object loading stalls

From

Tom Lane

Date:

24 February 2009, 10:49:55

Michael Akinde <michael.akinde@met.no> writes:
> Tom Lane wrote:
>> In the past we've seen this type of thing caused by multithreaded
>> client programs in which more than one thread tried to use the same
>> PGconn object without adequate interlocking.

> Our application is single-threaded, so it seems unlikely that we are
> running into a problem with that.

Well, maybe you've found an actual bug then; but without a test case
that other people can poke at, it's hard to investigate.

> The only thing I can think of right now, is that we are running a
> Postgres 8.3 on Debian Etch (so a backported debian package), whereas
> the libraries linked into our application are older library version
> (libpq4 and libpqxx 2.6.8).

Older libpq versions should work fine with an 8.3 server --- if they
don't, that's a bug in itself.  I do not know the status of libpqxx
though; you might want to check whether there are known bugs in that
version.

            regards, tom lane

Re: Large object loading stalls

From

Michael Akinde

Date:

25 February 2009, 07:45:54

Tom Lane wrote:

Michael Akinde <michael.akinde@met.no> writes:

Tom Lane wrote:

In the past we've seen this type of thing caused by multithreaded
client programs in which more than one thread tried to use the same
PGconn object without adequate interlocking.

Our application is single-threaded, so it seems unlikely that we are 
running into a problem with that.

Well, maybe you've found an actual bug then; but without a test case
that other people can poke at, it's hard to investigate.

The only thing I can think of right now, is that we are running a 
Postgres 8.3 on Debian Etch (so a backported debian package), whereas 
the libraries linked into our application are older library version 
(libpq4 and libpqxx 2.6.8).

Older libpq versions should work fine with an 8.3 server --- if they
don't, that's a bug in itself.  I do not know the status of libpqxx
though; you might want to check whether there are known bugs in that
version.

There are a few known bugs in pqxx that *might* relate to our code; for instance, we use prepared statements, and this is an area that has seen a good deal of work from 2.6.8 to 2.6.9.

The test work at the moment is hampered by our server section needing to physically move some of our development & test servers this week, but if upgrading pqxx doesn't seem to eliminate the problem, I'll do some work to see if I can recreate the problem in an isolated setting.

Regards,

Michael A.

Attachment

michael.akinde.vcf