BUG #18005: PSQL Process hangs in parallel mode - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #18005: PSQL Process hangs in parallel mode |
Date | |
Msg-id | 18005-5cd026ae9ac57c17@postgresql.org Whole thread Raw |
Responses |
Re: BUG #18005: PSQL Process hangs in parallel mode
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18005 Logged by: Serge Bellina Email address: ess.bee59@gmail.com PostgreSQL version: 14.8 Operating system: Ubuntu-2204-jammy-amd64-base Description: Dear postgresql team! here additional / new informations to BUG #17990 and BUG #17999. I am using postgresql/postgis to calculate (better "estimate") pseudo-osm tags to enhance the route-calculation in the "Brouter" for bikers . Each continent is calculated in a separate database (Europe, north-america, etc..) For the planet the "sequential" calculation runs whithout errors, but it needs 26 hours. Unfortunately the "long" SQL´s (joins) used are running as mono-process (they are only 1 or 2 % of the time running in 3 parallel processes). That for I tried to start 2 parallel psql-processes (Europe in psql-process 1 and the rest of the planet in psql-process 2) The problem: By the first 3 tests psql-process 2 "hanged" every time for hours / never ends !!!!! VERY STRANGE: more than 30 SQL#s are used to process each continent, but psql-process 2 every time hanged on the same SELECT (nearly 10 selects were processed bevore) VERY STRANGE: the time stamps of the hanging select for asia (2023-06-26 16:34:26) is identical with the start-time of the select itself (the time stamp of psql-process 1 for europe is at this time 2023-06-26 17:22:51.41 !!!!) No message at all in the postgresql or Linux logs! Just the CPU load indicated that the process was still "working". Also after the successful termination of the psql-process 1, psql-process 2 hangs longer and longer .... / same state Kill -9 on the psql-process 2 in that state ( 1+ 2 processses) had no effect: just a full database restart (administrator) could stop the loop Remark 1: the second test was done by adding th config parameter "max_parallel_workers_per_gather = 0" (no parallel processes whithin a SQL!), but the "hanging" problem also occured Remark 2: after the DB restart, the previously hanging sql in psql-process 2 was restarted and ended successfully after 50 minutes. Many logs were collected (logs of the SQL´s, top, ps lsof, pg_activity_stat, memory usage, ..) and could be submitted by need (I do not know the proper way to do that) Here some extracts: My configuration: ==> PostgreSQL 14.8 (Ubuntu 14.8-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit ==> postgres@Ubuntu-2204-jammy-amd64-base ------------------------------- 2023-06-26 16:34:26.163755+02 ==> START TIME of psql-process 2 in the last test (1 row) -- create tags for noise -- create raw data -- when several highways-segments are producing noise, aggregate the noises using the "ST_Union" of the segments! -- (better as using "sum" or "max" that do not deliver good factors) SELECT m.osm_id losmid, m.highway lhighway, q.highway as qhighway, q.maxspeed_class, case when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then st_area(st_intersection(m.way, ST_Union( q.way))) / st_area(m.way) when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') then st_area(st_intersection(m.way, ST_Union( q.way))) / (1.5 * st_area(m.way)) when q.highway in ('primary','primary_link') and q.maxspeed_class < 2.1 then st_area(st_intersection(m.way, ST_Union( q.way))) / (2 * st_area(m.way)) when q.highway in ('primary','primary_link') then st_area(st_intersection(m.way, ST_Union( q.way))) / (3 * st_area(m.way)) when q.highway in ('secondary') and q.maxspeed_class < 2.1 then st_area(st_intersection(m.way, ST_Union( q.way))) / (3 * st_area(m.way)) when q.highway in ('secondary') then st_area(st_intersection(m.way, ST_Union( q.way))) / (5 * st_area(m.way)) end as noise_factor into table noise_tmp FROM osm_line_buf_50 AS m INNER JOIN osm_line_buf_50 AS q ON ST_Intersects(m.way, q.way) WHERE m.highway is not null and q.highway in ('motorway', 'motorway_link','trunk','trunk_link','primary','primary_link','secondary') GROUP BY losmid, lhighway, m.way, q.highway, q.maxspeed_class order by noise_factor desc; EXPLAIN ----------------- Sort (cost=337741901772.50..337742103465.40 rows=80677160 width=1020) Sort Key: ((st_area(st_intersection(m.way, st_union(q.way), '-1'::double precision)) / st_area(m.way))) DESC -> GroupAggregate (cost=302522177901.86..337695183065.46 rows=80677160 width=1020) Group Key: m.osm_id, m.highway, m.way -> Gather Merge (cost=302522177901.86..317079159018.69 rows=124988590885 width=1675) Workers Planned: 2 -> Sort (cost=302522176901.84..302652373350.68 rows=52078579535 width=1675) Sort Key: m.osm_id, m.highway, m.way -> Nested Loop (cost=0.42..217422937427.56 rows=52078579535 width=1675) -> Parallel Seq Scan on osm_poly_no_buf q (cost=0.00..223957561.03 rows=939141 width=663) Filter: (((landuse = ANY ('{forest,allotments,flowerbed,orchard,vineyard,recreation_ground,village_green}'::text[])) OR (leisure = ANY ('{garden,park,nature_reserve}'::text[]))) AND ((st_area((st_transform(way, 4326))::geography, true) / '1000000'::double precision) < '5000'::double precision)) -> Index Scan using osm_line_buf_50_idx on osm_line_buf_50 m (cost=0.42..231193.41 rows=8068 width=1012) Index Cond: (way && q.way) Filter: ((highway IS NOT NULL) AND st_intersects(way, q.way)) JIT: Functions: 12 Options: Inlining true, Optimization true, Expressions true, Deforming true (17 rows) --SELECT datname, pid, state, query, age(clock_timestamp(), query_start) AS age select * FROM pg_stat_activity WHERE state <> 'idle' AND query NOT LIKE '% FROM pg_stat_activity %'; datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type -----------+---------+---------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+----------+----------------------------------------------------------------------------------------------------------+----------------- 105680476 | asia | 1192433 | | 10 | postgres | psql | | | -1 | 2023-06-26 16:02:05.768052+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | IO | DataFileRead | active | 289276 | 289208 | | SELECT +| client backend | | | | | | | | | | | | | | | | | | | | m.osm_id losmid, m.highway lhighway, q.highway as qhighway, q.maxspeed_class, +| | | | | | | | | | | | | | | | | | | | | case +| | | | | | | | | | | | | | | | | | | | | when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+| .............. 103790736 | europe | 1183475 | | 10 | postgres | psql | | | -1 | 2023-06-26 15:21:15.376492+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.420039+02 | | | active | 289286 | 289276 | | select xid , sum(water_river_see) as river_see +| client backend | | | | | | | | | | | | | | | | | | | | into table river_tmp +| | | | | | | | | | | | | | | | | | | | | from ( +| ...... 105680476 | asia | 1197807 | 1192433 | 10 | postgres | psql | | | | 2023-06-26 16:34:26.169606+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.183272+02 | IPC | BufferIO | active | | 289208 | | SELECT +| parallel worker | | | | | | | | | | | | | | | | | | | | m.osm_id losmid, m.highway lhighway, q.highway as qhighway, q.maxspeed_class, +| | | | | | | | | | | | | | | | | | | | | case +| | | | | | | | | | | | | | | | | | | | | when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+| | | | | | | | | | | | | | | | | | | | | .......... 105680476 | asia | 1197808 | 1192433 | 10 | postgres | psql | | | | 2023-06-26 16:34:26.169819+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26 16:34:26.183619+02 | IPC | BufferIO | active | | 289208 | | SELECT +| parallel worker | | | | | | | | | | | | | | | | | | | | m.osm_id losmid, m.highway lhighway, q.highway as qhighway, q.maxspeed_class, +| | | | | | | | | | | | | | | | | | | | | case +| | | | | | | | | | | | | | | | | | | | | when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+| | | | | | | | | | | ............ 103790736 | europe | 1208383 | 1183475 | 10 | postgres | psql | | | | 2023-06-26 17:22:51.749632+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.76465+02 | IPC | MessageQueueSend | active | | 289276 | | select xid , sum(water_river_see) as river_see +| parallel worker | | | | | | | | | | | | | | | | | | | | into table river_tmp +| | | | | | | | | | | | | | | | | | | | | from ( +| | | | | | | | | | | | | | | | | | | | | SELECT m.osm_id as xid, +| | | | | | | | | | | | | | | | | | | | | st_area(st_intersection(m.way, ST_Union( q.way))) / st_area(m.way) +| | | | | | | | | | | .............. 103790736 | europe | 1208384 | 1183475 | 10 | postgres | psql | | | | 2023-06-26 17:22:51.749853+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26 17:22:51.764351+02 | IPC | MessageQueueSend | active | | 289276 | | select xid , sum(water_river_see) as river_see +| parallel worker | | | | | | | | | | | | | | | | | | | | into table river_tmp +| | | | | | | | | | | | | | | | | | | | | from ( +| | | | | | | | | | | | | | | | | | | | | SELECT m.osm_id as xid, +| .............. free -h -w total used free shared buffers cache available Mem: 125Gi 35Gi 785Mi 50Gi 7.0Mi 89Gi 38Gi Swap: 4.0Gi 4.0Gi 0.0Ki top -n 1 -u postgres top - 18:55:27 up 22 days, 5:38, 1 user, load average: 3.65, 4.03, 4.14 Tasks: 425 total, 2 running, 423 sleeping, 0 stopped, 0 zombie %Cpu(s): 7.8 us, 2.4 sy, 0.0 ni, 87.5 id, 2.2 wa, 0.0 hi, 0.2 si, 0.0 st MiB Mem : 128504.4 total, 827.2 free, 36121.8 used, 91555.4 buff/cache MiB Swap: 4091.0 total, 0.0 free, 4091.0 used. 39010.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1183475 postgres 20 0 62.9g 60.3g 48.7g R 93.8 48.0 190:34.19 postgres 1192433 postgres 20 0 51.4g 38.3g 38.2g D 18.8 30.5 58:38.73 postgres 1197807 postgres 20 0 51.3g 18.7g 18.6g S 18.8 14.9 34:19.57 postgres 1197808 postgres 20 0 51.3g 18.9g 18.9g S 18.8 15.1 32:08.32 postgres 1169290 postgres 20 0 51.2g 877644 876996 S 0.0 0.7 0:00.64 postgres 1169293 postgres 20 0 51.2g 49.7g 49.7g S 0.0 39.6 3:46.63 postgres 1169294 postgres 20 0 51.2g 24.7g 24.7g S 0.0 19.7 0:15.49 postgres 1169295 postgres 20 0 51.2g 17900 17344 S 0.0 0.0 1:06.14 postgres 1169296 postgres 20 0 51.2g 4036 3200 S 0.0 0.0 0:00.12 postgres 1169297 postgres 20 0 73460 3028 2040 S 0.0 0.0 0:02.27 postgres 1169298 postgres 20 0 51.2g 3704 3036 S 0.0 0.0 0:00.00 postgres 1169387 postgres 20 0 2888 836 836 S 0.0 0.0 0:00.00 sh 1169388 postgres 20 0 2888 856 856 S 0.0 0.0 0:00.00 sh 1173213 postgres 20 0 2888 848 848 S 0.0 0.0 0:00.00 sh 1183474 postgres 20 0 21796 4872 4224 S 0.0 0.0 0:00.01 psql 1183699 postgres 20 0 8784 5248 3472 S 0.0 0.0 0:00.01 bash 1187783 postgres 20 0 2888 912 824 S 0.0 0.0 0:00.00 sh 1192431 postgres 20 0 21796 5476 4244 S 0.0 0.0 0:00.01 psql 1208383 postgres 20 0 61.3g 52.4g 42.3g S 0.0 41.8 2:20.11 postgres 1208384 postgres 20 0 61.3g 53.6g 43.6g S 0.0 42.7 2:21.45 postgres 1229692 postgres 20 0 7500 3884 3488 S 0.0 0.0 0:00.00 moni.sh 1229696 postgres 20 0 10896 4164 3300 R 0.0 0.0 0:00.00 top ps -ef postgres 1169290 1 0 14:20 ? 00:00:00 /usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/14/main -c config_file=/etc/postgresql/14/main/postgresql.conf postgres 1169293 1169290 0 14:20 ? 00:03:46 postgres: 14/main: checkpointer postgres 1169294 1169290 0 14:20 ? 00:00:16 postgres: 14/main: background writer postgres 1169295 1169290 0 14:20 ? 00:01:06 postgres: 14/main: walwriter postgres 1169296 1169290 0 14:20 ? 00:00:00 postgres: 14/main: autovacuum launcher postgres 1169297 1169290 0 14:20 ? 00:00:02 postgres: 14/main: stats collector postgres 1169298 1169290 0 14:20 ? 00:00:00 postgres: 14/main: logical replication launcher postgres 1169387 1 0 14:20 ? 00:00:00 /bin/sh ./gen-europe.sh . postgres 1169388 1 0 14:20 ? 00:00:00 /bin/sh ./gen-rest.sh . postgres 1173213 1169387 0 14:35 ? 00:00:00 /bin/sh ./create-tags.sh europe postgres 1183474 1173213 0 15:21 ? 00:00:00 /usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d europe -a -f all.sql postgres 1183475 1169290 92 15:21 ? 05:25:22 postgres: 14/main: postgres europe [local] SELECT postgres 1187783 1169388 0 15:39 ? 00:00:00 /bin/sh ./create-tags.sh asia postgres 1192431 1187783 0 16:02 ? 00:00:00 /usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d asia -a -f all.sql postgres 1192433 1169290 29 16:02 ? 01:31:54 postgres: 14/main: postgres asia [local] SELECT postgres 1197807 1169290 24 16:34 ? 01:08:08 postgres: 14/main: parallel worker for PID 1192433 postgres 1197808 1169290 22 16:34 ? 01:02:24 postgres: 14/main: parallel worker for PID 1192433 postgres 1259514 1169290 34 21:08 ? 00:01:19 postgres: 14/main: parallel worker for PID 1183475 postgres 1259515 1169290 35 21:08 ? 00:01:20 postgres: 14/main: parallel worker for PID 1183475 Thank for any help !
pgsql-bugs by date:
Previous
From: Michael PaquierDate:
Subject: Re: BUG #17997: Assert failed in validatePartitionedIndex() when attaching partition index to child of valid index
Next
From: PG Bug reporting formDate:
Subject: BUG #18006: recovery_target_action=shutdown triggers automatic recovery on next startup (beyond point in time)