BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup - Mailing list pgsql-bugs
From | charles.harwood@zuerchertech.com |
---|---|
Subject | BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup |
Date | |
Msg-id | 20150130221445.2583.8173@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 12718 Logged by: Charles R. Harwood Email address: charles.harwood@zuerchertech.com PostgreSQL version: 9.3.5 Operating system: Ubuntu Linux 12.04 Kernel 3.13.0 Description: Many postgres worker processes stopped responding. They continued to be in state 'active' and not 'waiting' for hours. Each query was on the table that an autovacuum process was also hung on processing and was the oldest active transaction on. Eventually I had to restart the cluster. The theory is that somehow the VACUUM got the table in a state such that certain queries ran against it would get hung trying to wrestle an LWlock away from it. This shows a subset of the pg_stat_activity many hours into the problem: # SELECT NOW() - xact_start AS xact_age, query_start, pid, waiting, state, SUBSTRING(query, 1, 30) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY NOW() - xact_start DESC; xact_age | query_start | pid | waiting | state | substring -----------------+-------------------------------+-------+---------+--------+---------------------------------------------------- 20:04:20.469875 | 2015-01-29 17:39:21.520094-06 | 24718 | f | active | autovacuum: VACUUM public.addr 19:59:58.066609 | 2015-01-29 17:43:43.92336-06 | 24687 | f | active | SELECT s0.id AS id, f0, f1, f1 19:59:40.054033 | 2015-01-29 17:44:01.935936-06 | 21386 | f | active | SELECT s0.id AS id, f0, f1, f1 19:59:24.117727 | 2015-01-29 17:44:17.872242-06 | 20370 | f | active | SELECT s0.id AS id, f0, f1, f1 19:32:51.946155 | 2015-01-29 18:10:50.064425-06 | 2900 | f | active | UPDATE addresses... 19:11:38.715295 | 2015-01-29 18:32:03.274674-06 | 13545 | f | active | SELECT s0.id AS id, f0, f1, f1 18:53:00.421641 | 2015-01-29 18:50:41.568328-06 | 2367 | f | active | SELECT '', '', street, city, s 12:53:32.658331 | 2015-01-30 00:50:09.331638-06 | 22220 | f | active | SELECT '', '', street, city, s 11:46:31.322061 | 2015-01-30 01:57:10.667908-06 | 17054 | f | active | SELECT addressesid FROM ((SEL 11:36:33.11271 | 2015-01-30 02:07:08.877259-06 | 22378 | f | active | SELECT addressesid FROM ((SEL 07:43:06.642865 | 2015-01-30 06:00:37.139222-06 | 26815 | f | active | SELECT s0.id AS id, f0, f1, f1 04:03:57.508697 | 2015-01-30 09:39:44.481272-06 | 19351 | f | active | SELECT addressesid FROM ((SEL 03:37:39.971883 | 2015-01-30 10:06:02.018086-06 | 10422 | f | active | SELECT addressesid FROM ((SEL 02:39:14.576335 | 2015-01-30 11:04:27.413634-06 | 21364 | f | active | SELECT addressesid FROM ((SEL 02:24:00.125199 | 2015-01-30 11:19:41.86477-06 | 14600 | f | active | SELECT addressesid FROM ((SEL 02:03:32.915133 | 2015-01-30 11:40:09.074836-06 | 22547 | f | active | SELECT addressesid FROM ((SEL 01:53:49.639909 | 2015-01-30 11:49:52.35006-06 | 31567 | f | active | SELECT s0.id AS id, f0, f1, f1 01:53:44.700417 | 2015-01-30 11:49:57.289552-06 | 31571 | f | active | SELECT s0.id AS id, f0, f1, f1 01:53:39.645157 | 2015-01-30 11:50:02.344812-06 | 31623 | f | active | SELECT s0.id AS id, f0, f1, f1 01:53:20.892623 | 2015-01-30 11:50:21.097346-06 | 31887 | f | active | SELECT s0.id AS id, f0, f1, f1 01:08:30.519618 | 2015-01-30 12:35:11.470351-06 | 21167 | f | active | SELECT namesid, name, type, in 01:06:58.586233 | 2015-01-30 12:36:43.403736-06 | 24513 | f | active | SELECT data.f5 AS y_id, data.f 01:06:18.159009 | 2015-01-30 12:37:23.83096-06 | 21047 | f | active | SELECT namesid, name, type, in Here is what those processes were up to. I have full backtraces which I will attach when this gets to a bugtracker. # ps -o pid,s,cmd,wchan=WIDE-WCHAN-COLUMN -p 2367,2900,10422,13545,14600,17054,19351,20370,21047,21167,21364,21386,22220,22378,22547,24513,24687,24718,26815,31567,31571,31623,31887 PID S CMD WIDE-WCHAN-COLUMN 2367 S postgres: leds leds 164.154 SYSC_semtimedop 2900 S postgres: leds leds [local] SYSC_semtimedop 10422 S postgres: leds leds 164.154 SYSC_semtimedop 13545 S postgres: leds leds 164.154 SYSC_semtimedop 14600 S postgres: leds leds 164.154 SYSC_semtimedop 17054 S postgres: leds leds 164.154 SYSC_semtimedop 19351 S postgres: leds leds 164.154 SYSC_semtimedop 20370 S postgres: leds leds 164.154 SYSC_semtimedop 21047 S postgres: leds leds 164.154 SYSC_semtimedop 21167 S postgres: leds leds 164.154 SYSC_semtimedop 21364 S postgres: leds leds 164.154 SYSC_semtimedop 21386 S postgres: leds leds 164.154 SYSC_semtimedop 22220 S postgres: leds leds 164.154 SYSC_semtimedop 22378 S postgres: leds leds 164.154 SYSC_semtimedop 22547 S postgres: leds leds 164.154 SYSC_semtimedop 24513 R postgres: leds leds 164.154 - 24687 S postgres: leds leds 164.154 SYSC_semtimedop 24718 S postgres: autovacuum worker poll_schedule_timeout 26815 S postgres: leds leds 164.154 SYSC_semtimedop 31567 S postgres: leds leds 164.154 SYSC_semtimedop 31571 S postgres: leds leds 164.154 SYSC_semtimedop 31623 S postgres: leds leds 164.154 SYSC_semtimedop 31887 S postgres: leds leds 164.154 SYSC_semtimedop Not sure if it will help finding how this happened, but it has happened twice, both times on relatively large databases (tens to hundreds of gigabytes) with several hundred concurrent connectons. Both times it happened while running a long-running script that updates each record in the 'addresses' table. The table contains 1-5 million records, each ~1KB wide and references several other tables and is itself referenced by a dozen tables. The script starts a transaction, runs 20 single-record UPDATEs in quick succession, COMMITs, and then briefly sleeps. So if nothing else, the txid is going up fairly rapidly and the database is working reasonably hard. Hardware does not appear to be a factor, but both servers on which this occurred are Dell rackmount servers with two modern intel chips ~E5-2630s and 32-128GB of memory. Again, stack traces will be attached. Thank you. -Charles
pgsql-bugs by date: