Home > mailing lists

Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From	torikoshia
Subject	Re: RFC: Logging plan of the running query
Date	February 7, 2024 04:08:24
Msg-id	ab2099c9aa64690851b444b6ea562988@oss.nttdata.com Whole thread Raw
In response to	Re: RFC: Logging plan of the running query (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses	Re: RFC: Logging plan of the running query
List	pgsql-hackers

Tree view

Hi Ashutosh,

On 2024-02-06 19:51, Ashutosh Bapat wrote:

> Thanks for the summary. It is helpful. I think patch is also getting 
> better.
> 
> I have a few questions and suggestions

Thanks for your comments.

> 1. Prologue of GetLockMethodLocalHash() mentions
>  * NOTE: When there are many entries in LockMethodLocalHash, calling 
> this
>  * function and looking into all of them can lead to performance 
> problems.
>  */
> How bad this performance could be. Let's assume that a query is taking
> time and pg_log_query_plan() is invoked to examine the plan of this
> query. Is it possible that the looping over all the locks itself takes
> a lot of time delaying the query execution further?

I think it depends on the number of local locks, but I've measured cpu 
time for this page lock check by adding below codes and 
v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which 
calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just for 
your information:

   diff --git a/src/backend/commands/explain.c 
b/src/backend/commands/explain.c
   index 5f7d77d567..65b7cb4925 100644
   --- a/src/backend/commands/explain.c
   +++ b/src/backend/commands/explain.c
   @@ -44,6 +44,8 @@

   +#include "time.h"
   ...
   @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void)
            * we check all the LocalLock entries and when finding even 
one, give up
            * logging the plan.
            */
   +       start = clock();
           hash_seq_init(&status, GetLockMethodLocalHash());
           while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != 
NULL)
           {
           if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
               {
                   ereport(LOG_SERVER_ONLY,
                       errmsg("ignored request for logging query plan due 
to page lock conflicts"),
                       errdetail("You can try again in a moment."));
                   hash_seq_term(&status);

                   ProcessLogQueryPlanInterruptActive = false;
                   return;
               }
           }
   +       end = clock();
   +       cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
   +
   +       ereport(LOG,
   +               errmsg("all locallock entry search took: %f", 
cpu_time_used));
   +

There were about 3 million log lines which recorded the cpu time, and 
the duration was quite short:

   =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d.
   =# select max(d), min(d), avg(d) from cpu_time ;
      max    | min |          avg
   ----------+-----+-----------------------
    0.000116 |   0 | 4.706274625332238e-07

I'm not certain that this is valid for actual use cases, but these 
results seem to suggest that it will not take that long.


> 2. What happens if auto_explain is enabled in the backend and
> pg_log_query_plan() is called on the same backend? Will they conflict?
> I think we should add a test for the same.

Hmm, I think they don't conflict since they just refer QueryDesc and 
don't modify it and don't use same objects for locking.
(I imagine 'conflict' here is something like 'hard conflict' in 
replication[2].)

Actually using both auto_explain and pg_log_query_plan() output each 
logs separately:

   (pid:62835)=# select pg_sleep(10);
   (pid:70000)=# select pg_log_query_plan(62835);

   (pid:70000)=# \! cat data/log/postgres.log
   ...
   2024-02-06 21:44:17.837 JST [62835:4:0] LOG:  00000: query plan 
running on backend with PID 62835 is:
         Query Text: select pg_sleep(10);
         Result  (cost=0.00..0.01 rows=1 width=4)
           Output: pg_sleep('10'::double precision)
         Query Identifier: 3506829283127886044
   2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION:  
ProcessLogQueryPlanInterrupt, explain.c:5336
   2024-02-06 21:44:26.974 JST [62835:6:0] LOG:  00000: duration: 
10000.868 ms  plan:
         Query Text: select pg_sleep(10);
         Result  (cost=0.00..0.01 rows=1 width=4) (actual 
time=10000.802..10000.804 rows=1 loops=1)

> Using injection point support we should be able to add tests for
> testing pg_log_query_plan behaviour when there are page locks held or
> when auto_explain (with instrumentation) and pg_log_query_plan() work
> on the same query plan. Use injection point to make the backend
> running query wait at a suitable point to delay its execution and fire
> pg_log_query_plan() from other backend. May be the same test could
> examine the server log file to see if the plan is indeed output to the
> server log file.
> 
> Given that the feature will be used when the things have already gone
> wrong, it should not make things more serious. So more testing and
> especially automated would help.

Thanks for the advice, it seems a good idea.
I'm going to try to add tests using injection point.


[1] 
https://www.postgresql.org/message-id/CAAaqYe8LXVXQhYy3yT0QOHUymdM%3Duha0dJ0%3DBEPzVAx2nG1gsw%40mail.gmail.com
[2] 
https://www.postgresql.org/docs/devel/hot-standby.html#HOT-STANDBY-CONFLICT

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

pgsql-hackers by date:

From: Peter Smith
Date: 07 February 2024, 04:00:13
Subject: Re: Synchronizing slots from primary to standby

From: Maiquel Grassi
Date: 07 February 2024, 04:13:40
Subject: RE: Psql meta-command conninfo+

Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

Previous

Next