Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | CAAaqYe9Mse7JC78DgT97sXoB6U+E5OL3zJAYRSsgy-TPP9PXmw@mail.gmail.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (James Coleman <jtc331@gmail.com>) |
Responses |
Re: RFC: Logging plan of the running query
|
List | pgsql-hackers |
On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 10:02 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > On 2023-06-16 01:34, James Coleman wrote: > > > Attached is v28 > > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish > > > when necessary. Once built with those two patches I'm simply running > > > `make check`. > > > > With v28-0001 and v28-0002 patch, I confirmed backend processes consume > > huge > > amount of memory and under some environments they were terminated by OOM > > killer. > > > > This was because memory was allocated from existing memory contexts and > > they > > were not freed after ProcessLogQueryPlanInterrupt(). > > Updated the patch to use dedicated memory context for > > ProcessLogQueryPlanInterrupt(). > > > > Applying attached patch and v28-0002 patch, `make check` successfully > > completed after 20min and 50GB of logs on my environment. > > > > >>> On 2023-06-15 01:48, James Coleman wrote: > > >>> > The tests have been running since last night, but have been apparently > > >>> > hung now for many hours. > > > > I don't know if this has anything to do with the hung you faced, but I > > thought > > it might be possible that the large amount of memory usage resulted in > > swapping, which caused a significant delay in processing. > > Ah, yes, I think that could be a possible explanation. I was delaying > on this thread because I wasn't comfortable with having caused an > issue once (even if I couldn't easily reproduce) without at least some > theory as to the cause (and a fix). > > > If possible, I would be very grateful if you could try to reproduce this > > with > > the v29 patch. > > I'll kick off some testing. > I don't have time to investigate what's happening here, but 24 hours later the first "make check" is still running, and at first glance it seems to have the same behavior I'd seen that first time. The test output is to this point: # parallel group (5 tests): index_including create_view index_including_gist create_index create_index_spgist ok 66 + create_index 26365 ms ok 67 + create_index_spgist 27675 ms ok 68 + create_view 1235 ms ok 69 + index_including 1102 ms ok 70 + index_including_gist 1633 ms # parallel group (16 tests): create_aggregate create_cast errors roleattributes drop_if_exists hash_func typed_table create_am infinite_recurse and it hasn't progressed past that point since at least ~16 hours ago (the first several hours of the run I wasn't monitoring it). I haven't connected up gdb yet, and won't be able to until maybe tomorrow, but here's the ps output for postgres processes that are running: admin 3213249 0.0 0.0 196824 20552 ? Ss Aug25 0:00 /home/admin/postgresql-test/bin/postgres -D /home/admin/postgresql-test-data admin 3213250 0.0 0.0 196964 7284 ? Ss Aug25 0:00 postgres: checkpointer admin 3213251 0.0 0.0 196956 4276 ? Ss Aug25 0:00 postgres: background writer admin 3213253 0.0 0.0 196956 8600 ? Ss Aug25 0:00 postgres: walwriter admin 3213254 0.0 0.0 198424 7316 ? Ss Aug25 0:00 postgres: autovacuum launcher admin 3213255 0.0 0.0 198412 5488 ? Ss Aug25 0:00 postgres: logical replication launcher admin 3237967 0.0 0.0 2484 516 pts/4 S+ Aug25 0:00 /bin/sh -c echo "# +++ regress check in src/test/regress +++" && PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH" LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib" INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20 --schedule=./parallel_schedule admin 3237973 0.0 0.0 197880 20688 pts/4 S+ Aug25 0:00 postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c listen_addresses= -k /tmp/pg_regress-7mmGUa admin 3237976 0.0 0.1 198332 44608 ? Ss Aug25 0:00 postgres: checkpointer admin 3237977 0.0 0.0 198032 4640 ? Ss Aug25 0:00 postgres: background writer admin 3237979 0.0 0.0 197880 8580 ? Ss Aug25 0:00 postgres: walwriter admin 3237980 0.0 0.0 199484 7608 ? Ss Aug25 0:00 postgres: autovacuum launcher admin 3237981 0.0 0.0 199460 5488 ? Ss Aug25 0:00 postgres: logical replication launcher admin 3243644 0.0 0.2 252400 74656 ? Ss Aug25 0:01 postgres: admin regression [local] SELECT waiting admin 3243645 0.0 0.1 205480 33992 ? Ss Aug25 0:00 postgres: admin regression [local] SELECT waiting admin 3243654 99.9 0.0 203156 31504 ? Rs Aug25 1437:49 postgres: admin regression [local] VACUUM admin 3243655 0.0 0.1 212036 38504 ? Ss Aug25 0:00 postgres: admin regression [local] SELECT waiting admin 3243656 0.0 0.0 206024 30892 ? Ss Aug25 0:00 postgres: admin regression [local] DELETE waiting admin 3243657 0.0 0.1 205568 32232 ? Ss Aug25 0:00 postgres: admin regression [local] ALTER TABLE waiting admin 3243658 0.0 0.0 203740 21532 ? Ss Aug25 0:00 postgres: admin regression [local] ANALYZE waiting admin 3243798 0.0 0.0 199884 8464 ? Ss Aug25 0:00 postgres: autovacuum worker admin 3244733 0.0 0.0 199492 5956 ? Ss Aug25 0:00 postgres: autovacuum worker admin 3245652 0.0 0.0 199884 8468 ? Ss Aug25 0:00 postgres: autovacuum worker As you can see there are a bunch of backends presumably waiting, and also the VACUUM process has been pegging a single CPU core for at least since that ~16 hour ago mark. I hope to be able to do more investigation later, but I wanted to at least give you this information now. James Coleman
pgsql-hackers by date: