postgres bug report - Mailing list pgsql-bugs
From | jfaith |
---|---|
Subject | postgres bug report |
Date | |
Msg-id | 39782BE3.3FC29D2F@cemsys.com Whole thread Raw |
Responses |
Re: postgres bug report
|
List | pgsql-bugs |
============================================================================ POSTGRESQL BUG REPORT TEMPLATE ============================================================================ Your name :Jeremy Faith Your email address :jfaith@cemsys.com System Configuration --------------------- Architecture (example: Intel Pentium) :Intel Pentium Operating System (example: Linux 2.0.26 ELF): SCO OpenServer 5.0.5 PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0 Compiler used (example: gcc 2.8.0) : egcs-2.91.66(egcs-1.1.2 release) Please enter a FULL description of your problem: ------------------------------------------------ A daily 'vacuum verbose' cron failed, the end of the log follows:- . . . NOTICE: --Relation allcards-- NOTICE: Pages 1517: Changed 5, reaped 231, Empty 0, New 0; Tup 192248: Vac 396, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 60, MaxLen 60; Re-using: Free/Avail. Space 37656/35140; EndEmpty/Avail. Pages 0/230. CPU 1.39s/1.31u sec. NOTICE: Index alc1: Pages 1063; Tuples 192248: Deleted 396. CPU 0.40s/1.78u sec. NOTICE: Rel allcards: Pages: 1517 --> 1514; Tuple(s) moved: 280. CPU 0.45s/0.59u sec. NOTICE: Index alc1: Pages 1063; Tuples 192248: Deleted 280. CPU 0.31s/1.42u sec. NOTICE: --Relation cards-- NOTICE: Pages 1307: Changed 188, reaped 1245, Empty 0, New 0; Tup 104774: Vac 16484, Keep/VTL 0/0, Crash 0, UnUsed 85, MinLen 72, MaxLen 88; Re-using: Free/Avail. Space 1496136/1491468; EndEmpty/Avail. Pages 0/1244. CPU 0.69s/1.73u sec. NOTICE: Index crd4: Pages 518; Tuples 104774: Deleted 16484. CPU 0.23s/1.57u sec. NOTICE: Index crd3: Pages 381; Tuples 104774: Deleted 16484. CPU 0.28s/1.72u sec. NOTICE: Index crd2: Pages 785; Tuples 104774: Deleted 16484. CPU 0.57s/1.85u sec. NOTICE: Index crd1: Pages 533; Tuples 104774: Deleted 16484. CPU 0.47s/1.72u sec. NOTICE: Rel cards: Pages: 1307 --> 1121; Tuple(s) moved: 6159. CPU 26.80s/9.18u sec. NOTICE: Index crd4: Pages 518; Tuples 104774: Deleted 6159. CPU 0.34s/0.95u sec. NOTICE: Index crd3: Pages 381; Tuples 104774: Deleted 6159. CPU 0.17s/0.79u sec. NOTICE: Index crd2: Pages 785; Tuples 104774: Deleted 6159. CPU 0.50s/0.98u sec. NOTICE: Index crd1: Pages 533; Tuples 104774: Deleted 6159. CPU 0.43s/0.81u sec. pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. connection to server was lost There are two programs(libpq) that access this DB. prtcwri:-which receives audits from a master DB and produces SQL statements(inserts/updates/deletes) to mirror the effects in the Postgres DB. prtn_ncn:-that spawns a set of child processes(in this case 8) to service requests from serial ports. These are all selects except for one update on the cards table. On this machine about 18000 of these updates occur per day. The prtc_ncn process died when the cron ran. On restart it seemed to be fine. At the time the cron runs prtcwri would not have any activity and did not die, when audits came though later its log file indicated they were performed correctly. However when I ran psql selects on cards did not show the new data! This machine is in production use so I am sorry but I was unable to perform any more analysis. I reinitialised the DB(dropdb,createdb etc) from the master(seperate machine running Informix-SE) and restarted my processes. I disabled the cron and the machine has been fine for the last few days. 2 other machines,identical software, at the same site are operating normally but do not perform as many updates. If it will be helpfull I can simulate most of these cicumstances on a test machine here in an attempt to recreate the failure. What debugging options should I use? On a related theme but not a bug. I've just looked at the code for vacuum. If I understand correctly it locks each table in turn exclusively. This can take some time(in my case about 5 minutes for cards) for each table. For my purposes I need the cards table to be accessable 24x7 an exclusive lock for 1 or 2 seconds would probably be ok but certainly not minutes. As a work around(kludge) instead of doing a vacuum I am currently experimenting with the following:- create table cards_new... insert into cards_new select * from cards; create indexes on new cards...(using an incrementing extension number) alter table cards rename to cards_old; alter table cards_new rename to cards; So the cards table is only unavailable for a brief period while the 2 rename statements run. I then replicate missed updates from cards_old into cards (there is a last_time field which makes this easy enough). If this script is run from prtcwri no other insert/delete/update statements run on the DB. It has the desired effect of keeping performance up and takes less time than a vacuum cards!. I don't like this solution but I need something to keep the performance of the DB up. Most queries run very rapidly 10-100ms for a sequence at least 4 selects but after a few days without a vacuum it gets more like 500ms average. I need this to be at worst 2s. Q)Is this solution ok in principle or is there a fundamental flaw? Both programs connect to the DB at start up(parent prtc_ncn diconnects and forks each child process which then opens its own connection) and only close the connection when killed. What I would most like to see is a non exclusive locking version of vacuum. Maybe 'vacuum lite' which might not be able to reclaim space but could keep performance up. I realise this is a complicated procedure at the very heart of the system. But I would be happy to work on this if it seems feasible and would be acceptable to others. I also know that a non overwriting storage manager may be available in the future along with WAL but I'm looking for something soon. A simple addition that might be handy would be the inclusion of actual time taken to perform each vacuum operation(I know other processes etc will cause this to vary heavily) as well as the already included rusage values. This would at least give an indication of how long a table is locked for. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- If you know how this problem might be fixed, list the solution below: ---------------------------------------------------------------------
pgsql-bugs by date: