Thread: On-demand running query plans using auto_explain and signals
Attachment
On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote: > Probably using SIGUSR2 would be more appropriate, but I'm not sure if there > are other extensions out there that might be already using it for some > other reason (well, I do not know that for SIGUSR1 either). Looking at the > current state of affairs in procsignal_sigusr1_handler() makes me believe > it should be pretty safe to catch the signal like I do. Or is that not the > case? You can catch signals, but you're not allowed to do a lot from them. Anything allocating memory, acquiring locks, etc. is out - these functions aren't reentrant. If you can guarantee that you're not interrupting any relevant code you can bend those rules, but that's obviously not the case here. Check out the list of async-signal-safe functions at http://man7.org/linux/man-pages/man7/signal.7.html Andres
On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote:
> Probably using SIGUSR2 would be more appropriate, but I'm not sure if there
> are other extensions out there that might be already using it for some
> other reason (well, I do not know that for SIGUSR1 either). Looking at the
> current state of affairs in procsignal_sigusr1_handler() makes me believe
> it should be pretty safe to catch the signal like I do. Or is that not the
> case?
You can catch signals, but you're not allowed to do a lot from
them. Anything allocating memory, acquiring locks, etc. is out - these
functions aren't reentrant. If you can guarantee that you're not
interrupting any relevant code you can bend those rules, but that's
obviously not the case here.
Check out the list of async-signal-safe functions at http://man7.org/linux/man-pages/man7/signal.7.html
On Sat, Aug 29, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote:
> Probably using SIGUSR2 would be more appropriate, but I'm not sure if there
> are other extensions out there that might be already using it for some
> other reason (well, I do not know that for SIGUSR1 either). Looking at the
> current state of affairs in procsignal_sigusr1_handler() makes me believe
> it should be pretty safe to catch the signal like I do. Or is that not the
> case?
You can catch signals, but you're not allowed to do a lot from
them. Anything allocating memory, acquiring locks, etc. is out - these
functions aren't reentrant. If you can guarantee that you're not
interrupting any relevant code you can bend those rules, but that's
obviously not the case here.
Check out the list of async-signal-safe functions at http://man7.org/linux/man-pages/man7/signal.7.htmlGood point. There's still hope to set a flag and process it later on. Will have to check if it's possible to stay in the scope of a loaded module though.
On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote: > 2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> > > Good point. There's still hope to set a flag and process it later on. > > Will have to check if it's possible to stay in the scope of a loaded module > > though. > I had a workable prototype - and It was implemented very similar as > handling CANCEL Where did you put the handling of that kind of interrupt? Directly into ProcessInterrupts()? Andres
On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:
> 2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
> > Good point. There's still hope to set a flag and process it later on.
> > Will have to check if it's possible to stay in the scope of a loaded module
> > though.
> I had a workable prototype - and It was implemented very similar as
> handling CANCEL
Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?
Andres
<p dir="ltr">On Aug 29, 2015 7:31 PM, "Pavel Stehule" <<a href="mailto:pavel.stehule@gmail.com">pavel.stehule@gmail.com</a>>wrote:<br /> ><br /> ><br /> ><br /> > 2015-08-2918:36 GMT+02:00 Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>:<br /> >><br/> >> On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:<br /> >> > 2015-08-29 18:25 GMT+02:00Shulgin, Oleksandr <<a href="mailto:oleksandr.shulgin@zalando.de">oleksandr.shulgin@zalando.de</a>><br />>> > > Good point. There's still hope to set a flag and process it later on.<br /> >> > > Willhave to check if it's possible to stay in the scope of a loaded module<br /> >> > > though.<br /> >><br/> >> > I had a workable prototype - and It was implemented very similar as<br /> >> > handlingCANCEL<br /> >><br /> >> Where did you put the handling of that kind of interrupt? Directly into<br />>> ProcessInterrupts()?<br /> ><br /> ><br /> > Probably. I don't remember it well, but it need hack code- it cannot be used from extension.<p dir="ltr">Do you still have the code somewhere around? Did it see production use?<pdir="ltr">Thanks!<br /> --<br /> Alex<br />
On Aug 29, 2015 7:31 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:
>
>
>
> 2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:
>>
>> On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:
>> > 2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
>> > > Good point. There's still hope to set a flag and process it later on.
>> > > Will have to check if it's possible to stay in the scope of a loaded module
>> > > though.
>>
>> > I had a workable prototype - and It was implemented very similar as
>> > handling CANCEL
>>
>> Where did you put the handling of that kind of interrupt? Directly into
>> ProcessInterrupts()?
>
>
> Probably. I don't remember it well, but it need hack code - it cannot be used from extension.Do you still have the code somewhere around? Did it see production use?
Thanks!
--
Alex
2015-08-30 10:30 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Aug 29, 2015 7:31 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:
>
>
>
> 2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:
>>
>> On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:
>> > 2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
>> > > Good point. There's still hope to set a flag and process it later on.
>> > > Will have to check if it's possible to stay in the scope of a loaded module
>> > > though.
>>
>> > I had a workable prototype - and It was implemented very similar as
>> > handling CANCEL
>>
>> Where did you put the handling of that kind of interrupt? Directly into
>> ProcessInterrupts()?
>
>
> Probably. I don't remember it well, but it need hack code - it cannot be used from extension.Do you still have the code somewhere around? Did it see production use?
http://www.postgresql.org/message-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com
I am not sure I am able to find it - I'll try. We didn't use it on production.Thanks!
--
Alex
Do you still have the code somewhere around? Did it see production use?
I sent it to mailing list year ago
http://www.postgresql.org/message-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com
Do you still have the code somewhere around? Did it see production use?
I sent it to mailing list year ago
http://www.postgresql.org/message-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.comAh, thanks! Somehow I've missed this mail. You didn't add the patch to a commitfest back then I think?
--Alex
Ah, thanks! Somehow I've missed this mail. You didn't add the patch to a commitfest back then I think?I had no time to finish this patch - there is few issues in signal handling and returning back result - but still I want it :) - and what I know - almost all other SQL db has similar functionality.
I'd say we should hide the so-designed pg_cmdstatus() interface behind more friendly calls like pg_explain_backend() and pg_backend_progress() to give some naming examples, to remove the need for magic numbers in the second arg.
Attachment
On Mon, Aug 31, 2015 at 12:35 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Ah, thanks! Somehow I've missed this mail. You didn't add the patch to a commitfest back then I think?I had no time to finish this patch - there is few issues in signal handling and returning back result - but still I want it :) - and what I know - almost all other SQL db has similar functionality.I've updated the patch for the current master and also added some unexpected parameters handling, so attached is a v2.
I'd say we should hide the so-designed pg_cmdstatus() interface behind more friendly calls like pg_explain_backend() and pg_backend_progress() to give some naming examples, to remove the need for magic numbers in the second arg.
What I've found missing in this approach is the insight into nested executor runs, so that if you're running a "SELECT my_func()", you only see this outer query in the pg_cmdstatus() output. With the auto_explain approach, by hooking into executor I was able to capture the nested queries and their plans as well.
It's conceptually trivial to add some code to use the Executor hooks here, but I don't see any precedent for this except for contrib modules (auto_explain and pg_stat_statements), I'm just not sure if that would be OK-ish.And when we solve that, there is another problem of having a sane interface to query the nested plans. For a psql user, probably the most interesting would be the topmost (level=1) and the innermost (e.g. level=-1) plans. We might also want to provide a full nesting of plans in a structured format like JSON or... *cough* XML, for programs to consume and display nicely with folding and stuff.And the most interesting would be making instrumentation work with all of the above.
I'm adding this to the next CF.--Alex
I'd say we should hide the so-designed pg_cmdstatus() interface behind more friendly calls like pg_explain_backend() and pg_backend_progress() to give some naming examples, to remove the need for magic numbers in the second arg.I had similar idea - this is good enough for start, but target interface iis based on integration with EXPLAIN statementsome like EXPLAIN PROCESS or EXPLAIN PID or EXPLAIN VERBOSE PID ..
the important functionality is drawing complete text of query - it was my original motivation, because I had not way how to get complete query before its finishingProbably the communication between processes should be more complex :( - the SHM queue should be used there, because some plans can be terrible long.The using shared write buffer (one for all) is too simply solution probably - good for prototype, but not good for core.I have a idea about communication:1. caller prepare buffer, shm queue and signalize target process - parameter is pid od caller2. target process fills a write buffer and close queue3. caller show data and close buffer, close queueNow almost all code for communication is in upstream - the missing part is injection one end of queue to any process dynamicaly.
I'd say we should hide the so-designed pg_cmdstatus() interface behind more friendly calls like pg_explain_backend() and pg_backend_progress() to give some naming examples, to remove the need for magic numbers in the second arg.I had similar idea - this is good enough for start, but target interface iis based on integration with EXPLAIN statementsome like EXPLAIN PROCESS or EXPLAIN PID or EXPLAIN VERBOSE PID ..Yes, that's another way to do it.the important functionality is drawing complete text of query - it was my original motivation, because I had not way how to get complete query before its finishingProbably the communication between processes should be more complex :( - the SHM queue should be used there, because some plans can be terrible long.The using shared write buffer (one for all) is too simply solution probably - good for prototype, but not good for core.I have a idea about communication:1. caller prepare buffer, shm queue and signalize target process - parameter is pid od caller2. target process fills a write buffer and close queue3. caller show data and close buffer, close queueNow almost all code for communication is in upstream - the missing part is injection one end of queue to any process dynamicaly.I'm not familiar with the shared memory handling, but could we not allocate just enough shared memory to fit the data we're going to write instead of the fixed 8k? It's not that we cannot know the length of the resulting plan text in advance.
I think we can remove buffer_is_free/buffer_holds_data and just use the buffer != NULL to check if there's some data to be read (and buffer == NULL to check if we can write).--Alex
I'm not familiar with the shared memory handling, but could we not allocate just enough shared memory to fit the data we're going to write instead of the fixed 8k? It's not that we cannot know the length of the resulting plan text in advance.the shared memory cannot be reused - (released) :(, so allocating enough memory is not effective. More - in this moment it has not sense. Shared memory queue can do almost all work.
I'm not familiar with the shared memory handling, but could we not allocate just enough shared memory to fit the data we're going to write instead of the fixed 8k? It's not that we cannot know the length of the resulting plan text in advance.the shared memory cannot be reused - (released) :(, so allocating enough memory is not effective. More - in this moment it has not sense. Shared memory queue can do almost all work.A-ha, I've discovered the shared memory message queue facility and I see how we can use it.But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.
--Alex
But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.
On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.Yes, but in your implementation the slots themselves don't have a queue/buffer. Did you intend to have a message queue per slot?
What sort of pathological problems are you concerned of? The communicating backends should just detach from the message queue properly and have some timeout configured to prevent deadlocks. Other than that, I don't see how having N slots really help the problem: in case of pathological problems you will just deplete them all sooner or later.
--Alex
2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.Yes, but in your implementation the slots themselves don't have a queue/buffer. Did you intend to have a message queue per slot?The message queue cannot be reused, so I expect one slot per caller to be used passing parameters, - message queue will be created/released by demand by caller.
What sort of pathological problems are you concerned of? The communicating backends should just detach from the message queue properly and have some timeout configured to prevent deadlocks. Other than that, I don't see how having N slots really help the problem: in case of pathological problems you will just deplete them all sooner or later.I afraid of unexpected problems :) - any part of signal handling or multiprocess communication is fragile. Slots are simple and simply attached to any process without necessity to alloc/free some memory.
On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.Yes, but in your implementation the slots themselves don't have a queue/buffer. Did you intend to have a message queue per slot?The message queue cannot be reused, so I expect one slot per caller to be used passing parameters, - message queue will be created/released by demand by caller.I don't believe a message queue cannot really be reused. What would stop us from calling shm_mq_create() on the queue struct again?
To give you an idea, in my current prototype I have only the following struct:typedef struct {LWLock *lock;/*CmdStatusInfoSlot slots[CMDINFO_SLOTS];*/pid_t target_pid;pid_t sender_pid;int request_type;int result_code;shm_mq buffer;} CmdStatusInfo;An instance of this is allocated on shared memory once, using BUFFER_SIZE of 8k.In pg_cmdstatus() I lock on the LWLock to check if target_pid is 0, then it means nobody else is using this communication channel at the moment. If that's the case, I set the pids and request_type and initialize the mq buffer. Otherwise I just sleep and retry acquiring the lock (a timeout should be added here probably).What sort of pathological problems are you concerned of? The communicating backends should just detach from the message queue properly and have some timeout configured to prevent deadlocks. Other than that, I don't see how having N slots really help the problem: in case of pathological problems you will just deplete them all sooner or later.I afraid of unexpected problems :) - any part of signal handling or multiprocess communication is fragile. Slots are simple and simply attached to any process without necessity to alloc/free some memory.Yes, but do slots solve the actual problem? If there is only one message queue, you still have the same problem regardless of the number of slots you decide to have.--Alex
2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.Yes, but in your implementation the slots themselves don't have a queue/buffer. Did you intend to have a message queue per slot?The message queue cannot be reused, so I expect one slot per caller to be used passing parameters, - message queue will be created/released by demand by caller.I don't believe a message queue cannot really be reused. What would stop us from calling shm_mq_create() on the queue struct again?you cannot to change recipient later
On Wed, Sep 2, 2015 at 2:56 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:But do we really need the slots mechanism? Would it not be OK to just let the LWLock do the sequencing of concurrent requests? Given that we only going to use one message queue per cluster, there's not much concurrency you can gain by introducing slots I believe.I afraid of problems on production. When you have a queue related to any process, then all problems should be off after end of processes. One message queue per cluster needs restart cluster when some pathological problems are - and you cannot restart cluster in production week, sometimes weeks. The slots are more robust.Yes, but in your implementation the slots themselves don't have a queue/buffer. Did you intend to have a message queue per slot?The message queue cannot be reused, so I expect one slot per caller to be used passing parameters, - message queue will be created/released by demand by caller.I don't believe a message queue cannot really be reused. What would stop us from calling shm_mq_create() on the queue struct again?you cannot to change recipient laterWell, maybe I'm missing something, but sh_mq_create() will just overwrite the contents of the struct, so it doesn't care about sender/receiver: only sh_mq_set_sender/receiver() do.
Well, maybe I'm missing something, but sh_mq_create() will just overwrite the contents of the struct, so it doesn't care about sender/receiver: only sh_mq_set_sender/receiver() do.if you create sh_mq from scratch, then you can reuse structure.
On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Well, maybe I'm missing something, but sh_mq_create() will just overwrite the contents of the struct, so it doesn't care about sender/receiver: only sh_mq_set_sender/receiver() do.if you create sh_mq from scratch, then you can reuse structure.
Attachment
On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Well, maybe I'm missing something, but sh_mq_create() will just overwrite the contents of the struct, so it doesn't care about sender/receiver: only sh_mq_set_sender/receiver() do.if you create sh_mq from scratch, then you can reuse structure.Please find attached a v3.It uses a shared memory queue and also has the ability to capture plans nested deeply in the call stack. Not sure about using the executor hook, since this is not an extension...The LWLock is used around initializing/cleaning the shared struct and the message queue, the IO synchronization is handled by the message queue itself. After some testing with concurrent pgbench and intentionally deep recursive plpgsql functions (up to 700 plpgsql stack frames) I think this approach can work. Unless there's some theoretical problem I'm just not aware of. :-)
Comments welcome!
* pg_usleep(1000L); - it is related to single point resource
--Alex
Hi2015-09-03 18:30 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Well, maybe I'm missing something, but sh_mq_create() will just overwrite the contents of the struct, so it doesn't care about sender/receiver: only sh_mq_set_sender/receiver() do.if you create sh_mq from scratch, then you can reuse structure.Please find attached a v3.It uses a shared memory queue and also has the ability to capture plans nested deeply in the call stack. Not sure about using the executor hook, since this is not an extension...The LWLock is used around initializing/cleaning the shared struct and the message queue, the IO synchronization is handled by the message queue itself. After some testing with concurrent pgbench and intentionally deep recursive plpgsql functions (up to 700 plpgsql stack frames) I think this approach can work. Unless there's some theoretical problem I'm just not aware of. :-)Comments welcome!I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in one time can be executed per server - I remember lot of queries that doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be unfriendly. Cannot to say if it is good enough for first iteration. This is functionality that can be used for diagnostic when you have overloaded server and this risk looks too high (for me). The idea of receive slot can to solve this risk well (and can be used elsewhere). The difference from this code should not be too big - although it is not trivial - needs work with PGPROC. The opinion of our multiprocess experts can be interesting. Maybe I am too careful.
Other smaller issues:* probably sending line by line is useless - shm_mq_send can pass bigger data when nowait = false
* pg_usleep(1000L); - it is related to single point resourceSome ideas:* this code share some important parts with auto_explain (query stack) - and because it should be in core (due handling signal if I remember well), it can be first step of integration auto_explain to core.--Alex
<p dir="ltr">On Sep 3, 2015 10:14 PM, "Pavel Stehule" <<a href="mailto:pavel.stehule@gmail.com">pavel.stehule@gmail.com</a>>wrote:<br /> >>><br /> >>> Pleasefind attached a v3.<br /> >>><br /> >>> It uses a shared memory queue and also has the ability tocapture plans nested deeply in the call stack. Not sure about using the executor hook, since this is not an extension...<br/> >>><br /> >>> The LWLock is used around initializing/cleaning the shared struct and themessage queue, the IO synchronization is handled by the message queue itself.<br /> >><br /> >> I am not prettyhappy from this design. Only one EXPLAIN PID/GET STATUS in one time can be executed per server - I remember lot ofqueries that doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be unfriendly. Cannot to say if itis good enough for first iteration. This is functionality that can be used for diagnostic when you have overloaded serverand this risk looks too high (for me). The idea of receive slot can to solve this risk well (and can be used elsewhere).The difference from this code should not be too big - although it is not trivial - needs work with PGPROC. Theopinion of our multiprocess experts can be interesting. Maybe I am too careful.<p dir="ltr">Sorry, but I still don't seehow the slots help this issue - could you please elaborate?<p dir="ltr">>> Other smaller issues:<br /> >><br/> >> * probably sending line by line is useless - shm_mq_send can pass bigger data when nowait = false<pdir="ltr">I'm not sending it like that because of the message size - I just find it more convenient. If you thinkit can be problematic, its easy to do this as before, by splitting lines on the receiving side.<p dir="ltr">>>* pg_usleep(1000L); - it is related to single point resource<p dir="ltr">But not a highly concurrent one.<pdir="ltr">-<br /> Alex
On Sep 3, 2015 10:14 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:
>>>
>>> Please find attached a v3.
>>>
>>> It uses a shared memory queue and also has the ability to capture plans nested deeply in the call stack. Not sure about using the executor hook, since this is not an extension...
>>>
>>> The LWLock is used around initializing/cleaning the shared struct and the message queue, the IO synchronization is handled by the message queue itself.
>>
>> I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in one time can be executed per server - I remember lot of queries that doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be unfriendly. Cannot to say if it is good enough for first iteration. This is functionality that can be used for diagnostic when you have overloaded server and this risk looks too high (for me). The idea of receive slot can to solve this risk well (and can be used elsewhere). The difference from this code should not be too big - although it is not trivial - needs work with PGPROC. The opinion of our multiprocess experts can be interesting. Maybe I am too careful.Sorry, but I still don't see how the slots help this issue - could you please elaborate?
>> Other smaller issues:
>>
>> * probably sending line by line is useless - shm_mq_send can pass bigger data when nowait = falseI'm not sending it like that because of the message size - I just find it more convenient. If you think it can be problematic, its easy to do this as before, by splitting lines on the receiving side.
>> * pg_usleep(1000L); - it is related to single point resource
But not a highly concurrent one.
-
Alex
Sorry, but I still don't see how the slots help this issue - could you please elaborate?
with slot (or some similiar) there is not global locked resource. If I'll have a time at weekend I'll try to write some prototype.
>> Other smaller issues:
>>
>> * probably sending line by line is useless - shm_mq_send can pass bigger data when nowait = falseI'm not sending it like that because of the message size - I just find it more convenient. If you think it can be problematic, its easy to do this as before, by splitting lines on the receiving side.
Yes, shm queue sending data immediately - so slicing on sender generates more interprocess communication
>> * pg_usleep(1000L); - it is related to single point resource
But not a highly concurrent one.
I believe so it is not becessary - waiting (sleeping) can be deeper in reading from queue - the code will be cleaner
On Fri, Sep 4, 2015 at 6:11 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Sorry, but I still don't see how the slots help this issue - could you please elaborate?
with slot (or some similiar) there is not global locked resource. If I'll have a time at weekend I'll try to write some prototype.But you will still lock on the slots list to find an unused one. How is that substantially different from what I'm doing?
>> Other smaller issues:
>>
>> * probably sending line by line is useless - shm_mq_send can pass bigger data when nowait = falseI'm not sending it like that because of the message size - I just find it more convenient. If you think it can be problematic, its easy to do this as before, by splitting lines on the receiving side.
Yes, shm queue sending data immediately - so slicing on sender generates more interprocess communicationWell, we are talking about hundreds to thousands bytes per plan in total. And if my reading of shm_mq implementation is correct, if the message fits into the shared memory buffer, the receiver gets the direct pointer to the shared memory, no extra allocation/copy to process-local memory. So this can be actually a win.
>> * pg_usleep(1000L); - it is related to single point resource
But not a highly concurrent one.
I believe so it is not becessary - waiting (sleeping) can be deeper in reading from queue - the code will be cleanerThe only way I expect this line to be reached is when a concurrent pg_cmdstatus() call is in progress: the receiving backend has set the target_pid and has created the queue, released the lock and now waits to read something from shm_mq. So the backend that's trying to also use this communication channel can obtain the lwlock, checks if the channel is not used at the time, fails and then it needs to check again, but that's going to put a load on the CPU, so there's a small sleep.The real problem could be if the process that was signaled to connect to the message queue never handles the interrupt, and we keep waiting forever in shm_mq_receive(). We could add a timeout parameter or just let the user cancel the call: send a cancellation request, use pg_cancel_backend() or set statement_timeout before running this.
--Alex
Attachment
>>
>> But you will still lock on the slots list to find an unused one. How is that substantially different from what I'm doing?
>
> It is not necessary - you can use similar technique to what it does PGPROC. I am sending "lock free" demo.
>
> I don't afraid about locks - short locks, when the range and time are limited. But there are lot of bugs, and fixes with the name "do interruptible some", and it is reason, why I prefer typical design for work with shared memory.
Thanks, this is really helpful! The key difference is that every backend has a dedicated slot, so there's no need to search for a free one, which would again incur locking.
>> Well, we are talking about hundreds to thousands bytes per plan in total. And if my reading of shm_mq implementation is correct, if the message fits into the shared memory buffer, the receiver gets the direct pointer to the shared memory, no extra allocation/copy to process-local memory. So this can be actually a win.
>
> you have to calculate with signals and interprocess communication. the cost of memory allocation is not all.
Sure. Anyway, we're talking about only kilobytes being sent in this case, so the whole performance discussion is rather moot.
>> The real problem could be if the process that was signaled to connect to the message queue never handles the interrupt, and we keep waiting forever in shm_mq_receive(). We could add a timeout parameter or just let the user cancel the call: send a cancellation request, use pg_cancel_backend() or set statement_timeout before running this.
>
> This is valid question - for begin we can use a statement_timeout and we don't need to design some special (if you don't hold some important lock).
> My example (the code has prototype quality) is little bit longer, but it work without global lock - the requester doesn't block any other
>> The real problem could be if the process that was signaled to connect to the message queue never handles the interrupt, and we keep waiting forever in shm_mq_receive(). We could add a timeout parameter or just let the user cancel the call: send a cancellation request, use pg_cancel_backend() or set statement_timeout before running this.
>
> This is valid question - for begin we can use a statement_timeout and we don't need to design some special (if you don't hold some important lock).
> My example (the code has prototype quality) is little bit longer, but it work without global lock - the requester doesn't block any otherI'll update the commitfest patch to use this technique.
Attachment
On Tue, Sep 8, 2015 at 11:49 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
>> The real problem could be if the process that was signaled to connect to the message queue never handles the interrupt, and we keep waiting forever in shm_mq_receive(). We could add a timeout parameter or just let the user cancel the call: send a cancellation request, use pg_cancel_backend() or set statement_timeout before running this.
>
> This is valid question - for begin we can use a statement_timeout and we don't need to design some special (if you don't hold some important lock).
> My example (the code has prototype quality) is little bit longer, but it work without global lock - the requester doesn't block any otherI'll update the commitfest patch to use this technique.Please find attached v4.
--Alex
Please find attached v4.It is better
Two notices:1. The communication mechanism can be used more wide, than only for this purpose. We can introduce a SendInfoHook - and it can be used for any customer probes - memory, cpu, ...
2. With your support for explain of nested queries we have all what we need for integration auto_explain to core.
Two notices:1. The communication mechanism can be used more wide, than only for this purpose. We can introduce a SendInfoHook - and it can be used for any customer probes - memory, cpu, ...
So there's certainly a space for generalization. Rename it as pg_backend_info()?
2. With your support for explain of nested queries we have all what we need for integration auto_explain to core.Well, I don't quite see how that follows. What I'm really after is bringing instrumentation support to this, so that not only plan could be extracted, but also the rows/loops/times accumulated thus far during the query run.
--Alex
On Wed, Sep 9, 2015 at 8:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:Please find attached v4.It is betterOne important thing to notice, and probably deserves a code comment, that any modification of the slot fields done by the info sender backend must be done before actually sending the message via the shared memory queue (or detaching one, if there's nothing to be sent). Otherwise it is possible that is_processed flag will be set on the slot that has been already reused by the receiving backend.
Attachment
Hi, I did a quick initial review of this patch today, so here are my comments so far: - ipcs.c should include utils/cmdstatus.h (the compiler complains about implicit declaration of two functions) - Attempts to get plan for simple insert queries like this INSERT INTO x SELECT * FROM x; end with a segfault, because ActivePortal->queryDesc is 0x0 for this query. Needs investigation. - The lockless approach seems fine to me, although I think the fear of performance issues is a bit moot (I don't think weexpect large number of processes calling pg_cmdstatus at the same time). But it's not significantly more complex, sowhy not. - The patch contains pretty much no documentation, both comments at the code level and user docs. The lack of user docsis not that a big deal at this point (although the patch seems to be mature enough, although the user-level API willlikely change). The lack of code comments is more serious, as it makes the review somewhat more difficult. For example it'd be very niceto document the contract for the lock-less interface. - I agree that pg_cmdstatus() is not the best API. Having something like EXPLAIN PID would be nice, but it does not reallywork for all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe there's not a single API for all cases, i.e.we should use EXPLAIN PID for one case and invent something different for the other? - Is there a particular reason why we allocate slots for auxiliary processes and not just for backends (NumBackends)? Dowe expect those auxiliary processes to ever use this API? - CleanupCmdStatusSlot seems needlessly complicated. I don't quite see the need for the second argument, or the internalslot variable. Why not to simply use the MyCmdStatusSlot directly? - I also don't quite understand why we need to track css_pid for the slot? In what scenario will this actually matter? - While being able to get EXPLAIN from the running process is nice, I'm especially interested in getting EXPLAIN ANALYZEto get insight into the progress of the execution. The are other ways to get the EXPLAIN, e.g. by opening a differentconnection and actually running it (sure, the plan might have changed since then), but currently there's no wayto get insight into the progress. From the thread I get the impression that Oleksandr also finds this useful - correct? What are the plans in this direction? ISTM we need at least two things for that to work: (a) Ability to enable instrumentation on all queries (effectively what auto_explain allows), otherwise we can't getEXPLAIN ANALYZE on the queries later. But auto_explain is an extension, so that does not seem as a good matchif this is supposed to be in core. In that case a separate GUC seems appropriate. (b) Being able to run the InstrEnd* methods repeatedly - the initial message in this thread mentions issues with InstrEndLoopfor example. So perhaps this is non-trivial. - And finally, I think we should really support all existing EXPLAIN formats, not just text. We need to support the otherformats (yaml, json, xml) if we want to use the EXPLAIN PID approach, and it also makes the plans easier to processby additional tools. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I did a quick initial review of this patch today, so here are my comments so far:
- ipcs.c should include utils/cmdstatus.h (the compiler complains
about implicit declaration of two functions)
- Attempts to get plan for simple insert queries like this
INSERT INTO x SELECT * FROM x;
end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.
- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.
- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).
The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.
- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?
- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?
- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?
- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?
- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.
From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?
ISTM we need at least two things for that to work:
(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.
(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.
- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I did a quick initial review of this patch today, so here are my comments so far:
- ipcs.c should include utils/cmdstatus.h (the compiler complains
about implicit declaration of two functions)
- Attempts to get plan for simple insert queries like this
INSERT INTO x SELECT * FROM x;
end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.
- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.
- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).
The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.
- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?
PROGRESS
BACKTRACE
- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?
- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?
- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?
- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.
From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?
ISTM we need at least two things for that to work:
(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.
(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.
- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.
I can think of something like:EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;where option is extended with:QUERY
PROGRESS
BACKTRACEin addition to the usual ANALYZE, VERBOSE, FORMAT, etc.
On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote: > On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: ... > - Attempts to get plan for simple insert queries like this > > INSERT INTO x SELECT * FROM x; > > end with a segfault, because ActivePortal->queryDesc is 0x0 for this > query. Needs investigation. > > > Yes, I've hit the same problem after submitting the latest version of > the patch. For now I've just added a check for queryDesc being not > NULL, but I guess the top of the current_query_stack might contains > something useful. Something I need to try. Well, the thing is we're able to do EXPLAIN on those queries, and IIRC auto_explain can log them too. So perhaps look into the hooks where they take the queryDesc in those cases - it has to be available somewhere. > - The lockless approach seems fine to me, although I think the fear > of performance issues is a bit moot (I don't think we expect large > number of processes calling pg_cmdstatus at the same time). But > it's not significantly more complex, so why not. > > > I believe the main benefit of the less-locking approach is that if > something goes wrong when two backends tried to communicate it doesn't > prevent the rest of them from operating, because there is no shared (and > thus locked) communication channel. Sure, but I think it really deserves a bit more detailed explanation of the protocol, and discussion of the expected behavior for some basic failure types. For example - what happens when a backend requests info, but dies before receiving it, and the backed is reused for another connection? Doesn't this introduce a race condition? Perhaps not, I'm just asking. > > - The patch contains pretty much no documentation, both comments > at the code level and user docs. The lack of user docs is not that > a big deal at this point (although the patch seems to be mature > enough, although the user-level API will likely change). > > The lack of code comments is more serious, as it makes the review > somewhat more difficult. For example it'd be very nice to document > the contract for the lock-less interface. > > > I will add the code comments. The user docs could wait before we decide > on the interface, I think. Agreed, although I think having rudimentary user documentation would be useful for the reviewers - a summary of the goals that are a bit scattered over the e-mail thread. > - I agree that pg_cmdstatus() is not the best API. Having something > like EXPLAIN PID would be nice, but it does not really work for > all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe > there's not a single API for all cases, i.e. we should use EXPLAIN > PID for one case and invent something different for the other? > > > I can think of something like: > > EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>; > > where option is extended with: > > QUERY > PROGRESS > BACKTRACE > > in addition to the usual ANALYZE, VERBOSE, FORMAT, etc. Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same thing as ANALYZE? Why not to use the keyword, then? > > - Is there a particular reason why we allocate slots for auxiliary > processes and not just for backends (NumBackends)? Do we expect those > auxiliary processes to ever use this API? > > > If we extend the interface to a more general one, there still might be > some space for querying status of checkpointer of bgwriter. I don't think we should mix this with monitoring of auxiliary processes. This interface is designed at monitoring SQL queries running in other backends, effectively "remote" EXPLAIN. But those auxiliary processes are not processing SQL queries at all, they're not even using regular executor ... OTOH the ability to request this info (e.g. auxiliary process looking at plans running in backends) seems useful, so I'm ok with tuple slots for auxiliary processes. > > - CleanupCmdStatusSlot seems needlessly complicated. I don't quite see > the need for the second argument, or the internal slot variable. Why > not to simply use the MyCmdStatusSlot directly? > > > Good point. > > - I also don't quite understand why we need to track css_pid for the > slot? In what scenario will this actually matter? > > > I think it's being only used for error reporting and could help in > debugging, but for now that's it. I recommend getting rid of it, unless we have a clear idea of how to use it. Otherwise I envision we won't really keep it updated properly (because it's "just for debugging"), and then one day someone actually starts using it and get bitten by it. > > - While being able to get EXPLAIN from the running process is nice, > I'm especially interested in getting EXPLAIN ANALYZE to get insight > into the progress of the execution. The are other ways to get the > EXPLAIN, e.g. by opening a different connection and actually running > it (sure, the plan might have changed since then), but currently > there's no way to get insight into the progress. > > From the thread I get the impression that Oleksandr also finds this > useful - correct? What are the plans in this direction? > > ISTM we need at least two things for that to work: > > (a) Ability to enable instrumentation on all queries (effectively > what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE > on the queries later. But auto_explain is an extension, so that > does not seem as a good match if this is supposed to be in core. > In that case a separate GUC seems appropriate. > > (b) Being able to run the InstrEnd* methods repeatedly - the initial > message in this thread mentions issues with InstrEndLoop for > example. So perhaps this is non-trivial. > > > I was able to make this work with a simple change to InstrEndLoop and > the callers. Basically, adding a bool parameter in_explain and passing > an appropriate value. I guess that's not the best approach, but it > appears to work. > > Adding a GUC to enable instrumentation sounds reasonable. Nice! > > Do you believe it makes sense to add instrumentation support in this > same patch or better focus on making the simplest thing work first? Let's make it a "patch series" with two patches. > > - And finally, I think we should really support all existing EXPLAIN > formats, not just text. We need to support the other formats (yaml, > json, xml) if we want to use the EXPLAIN PID approach, and it also > makes the plans easier to process by additional tools. > > > Sure, that was in my plans (and see above for possible syntax). What > would be really neat is retrieving the complete backtrace. Not sure > what the good interface would look like, but using JSON format for the > output sounds promising. I don't quite see the reason to encode everything as JSON, because that makes it a bit difficult for clients that would prefer YAML, for example. Why not to just use the requested format? For example in YAML, we can do something like this: QUERY PLAN ---------------------------------- - Plan[0]: + Node Type: "Index Only Scan"+ Scan Direction:"Forward" + ... + + - Plan[1]: + Node Type: "Index Only Scan"+ Scan Direction: "Forward" + ... + + - Plan[2]: + Node Type: "Index Only Scan"+ Scan Direction: "Forward" + ... + and similarly for other formats. We don't really use nesting. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra...
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:- Attempts to get plan for simple insert queries like this
INSERT INTO x SELECT * FROM x;
end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.
Yes, I've hit the same problem after submitting the latest version of
the patch. For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful. Something I need to try.
Well, the thing is we're able to do EXPLAIN on those queries, and IIRC auto_explain can log them too. So perhaps look into the hooks where they take the queryDesc in those cases - it has to be available somewhere.
- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.
I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.
Sure, but I think it really deserves a bit more detailed explanation of the protocol, and discussion of the expected behavior for some basic failure types.
For example - what happens when a backend requests info, but dies before receiving it, and the backed is reused for another connection? Doesn't this introduce a race condition? Perhaps not, I'm just asking.
- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).
The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.
I will add the code comments. The user docs could wait before we decide
on the interface, I think.
Agreed, although I think having rudimentary user documentation would be useful for the reviewers - a summary of the goals that are a bit scattered over the e-mail thread.
- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?
I can think of something like:
EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;
where option is extended with:
QUERY
PROGRESS
BACKTRACE
in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.
Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same thing as ANALYZE? Why not to use the keyword, then?
- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?
If we extend the interface to a more general one, there still might be
some space for querying status of checkpointer of bgwriter.
I don't think we should mix this with monitoring of auxiliary processes. This interface is designed at monitoring SQL queries running in other backends, effectively "remote" EXPLAIN. But those auxiliary processes are not processing SQL queries at all, they're not even using regular executor ...
OTOH the ability to request this info (e.g. auxiliary process looking at plans running in backends) seems useful, so I'm ok with tuple slots for auxiliary processes.
- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?
Good point.
- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?
I think it's being only used for error reporting and could help in
debugging, but for now that's it.
I recommend getting rid of it, unless we have a clear idea of how to use it. Otherwise I envision we won't really keep it updated properly (because it's "just for debugging"), and then one day someone actually starts using it and get bitten by it.
- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.
From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?
ISTM we need at least two things for that to work:
(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.
(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.
I was able to make this work with a simple change to InstrEndLoop and
the callers. Basically, adding a bool parameter in_explain and passing
an appropriate value. I guess that's not the best approach, but it
appears to work.
Adding a GUC to enable instrumentation sounds reasonable.
Nice!
Do you believe it makes sense to add instrumentation support in this
same patch or better focus on making the simplest thing work first?
Let's make it a "patch series" with two patches.
- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.
Sure, that was in my plans (and see above for possible syntax). What
would be really neat is retrieving the complete backtrace. Not sure
what the good interface would look like, but using JSON format for the
output sounds promising.
I don't quite see the reason to encode everything as JSON, because that makes it a bit difficult for clients that would prefer YAML, for example. Why not to just use the requested format? For example in YAML, we can do something like this:
QUERY PLAN
----------------------------------
- Plan[0]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[1]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[2]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
and similarly for other formats. We don't really use nesting.
On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote: > On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote: > > On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com> > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>> wrote: > > ... > > - Attempts to get plan for simple insert queries like this > > INSERT INTO x SELECT * FROM x; > > end with a segfault, because ActivePortal->queryDesc is > 0x0 for this > query. Needs investigation. > > > Yes, I've hit the same problem after submitting the latest > version of > the patch. For now I've just added a check for queryDesc being not > NULL, but I guess the top of the current_query_stack might contains > something useful. Something I need to try. > > > Well, the thing is we're able to do EXPLAIN on those queries, and > IIRC auto_explain can log them too. So perhaps look into the hooks > where they take the queryDesc in those cases - it has to be > available somewhere. > > > Yes, this seems to work. I was able to successfully query "create table > as" and even "explain analyze" for the plans. In both cases > ActivePortal doesn't have a valid queryDesc, but the executor hook > captures one. > > - The lockless approach seems fine to me, although I think > the fear > of performance issues is a bit moot (I don't think we > expect large > number of processes calling pg_cmdstatus at the same > time). But > it's not significantly more complex, so why not. > > > I believe the main benefit of the less-locking approach is that if > something goes wrong when two backends tried to communicate it > doesn't > prevent the rest of them from operating, because there is no > shared (and > thus locked) communication channel. > > > Sure, but I think it really deserves a bit more detailed explanation > of the protocol, and discussion of the expected behavior for some > basic failure types. > > For example - what happens when a backend requests info, but dies > before receiving it, and the backed is reused for another > connection? Doesn't this introduce a race condition? Perhaps not, > I'm just asking. > > > Now explained better in code comments. > > The worst thing that could happen in this case I believe is that the > requesting backend will not receive any response from the second use of > its slot due to the slot being marked as processed by the backend that > was asked first: > > A: > slot->is_processed = false; > slot->is_valid = true; > /* signals backend B */; > shm_mq_read(); /* blocks */ > > B: /* finds the slot and prepares the response */ > > A: /* decides to bail out */ > InvalidateCmdStatusSlot(); > shm_mq_detach(); > /* exits pg_cmdstatus() */ > > /* pg_cmdstatus() is called again */ > /* initializes the slot and signals some backend */ > shm_mq_read(); /* blocks */ > > B: /* completes preparing the response */ > slot->is_processed = true; > shm_mq_send() => SHM_MQ_DETACHED > slot->is_valid = false; > /* gives up with this slot */ > > Now the backend that has been signaled on the second call to > pg_cmdstatus (it can be either some other backend, or the backend B > again) will not find an unprocessed slot, thus it will not try to > attach/detach the queue and the backend A will block forever. > > This requires a really bad timing and the user should be able to > interrupt the querying backend A still. I think we can't rely on the low probability that this won't happen, and we should not rely on people interrupting the backend. Being able to detect the situation and fail gracefully should be possible. It may be possible to introduce some lock-less protocol preventing such situations, but it's not there at the moment. If you believe it's possible, you need to explain and "prove" that it's actually safe. Otherwise we may need to introduce some basic locking - for example we may introduce a LWLock for each slot, and lock it with dontWait=true (and skip it if we couldn't lock it). This should prevent most scenarios where one corrupted slot blocks many processes. > > In any case, the backends that are being asked to send the info will be > able to notice the problem (receiver detached early) and handle it > gracefully. Ummm, how? Maybe I missed something? > I don't think we should mix this with monitoring of auxiliary > processes. This interface is designed at monitoring SQL queries > running in other backends, effectively "remote" EXPLAIN. But those > auxiliary processes are not processing SQL queries at all, they're > not even using regular executor ... > > OTOH the ability to request this info (e.g. auxiliary process > looking at plans running in backends) seems useful, so I'm ok with > tuple slots for auxiliary processes. > > > Now that I think about it, reserving the slots for aux process doesn't > let us query their status, it's the other way round. If we don't > reserve them, then aux process would not be able to query any other > process for the status. Likely this is not a problem at all, so we can > remove these extra slots. I don't know. I can imagine using this from background workers, but I think those are counted as regular backends (not sure though). > > - I also don't quite understand why we need to track > css_pid for the > slot? In what scenario will this actually matter? > > > I think it's being only used for error reporting and could help in > debugging, but for now that's it. > > > I recommend getting rid of it, unless we have a clear idea of how to > use it. Otherwise I envision we won't really keep it updated > properly (because it's "just for debugging"), and then one day > someone actually starts using it and get bitten by it. > > > The design was copied over from procsignal.c, where the pss_pid is used > to "claim" the procsignal slot. There it's used for searching the > target process slot when backend id is not known. We don't use that in > cmd status slots because of the inverted slots logic, so we could really > remove this field. OK. > I don't quite see the reason to encode everything as JSON, because > that makes it a bit difficult for clients that would prefer YAML, > for example. Why not to just use the requested format? For example > in YAML, we can do something like this: > > QUERY PLAN > ---------------------------------- > - Plan[0]: + > Node Type: "Index Only Scan"+ > Scan Direction: "Forward" + > ... + > + > - Plan[1]: + > Node Type: "Index Only Scan"+ > Scan Direction: "Forward" + > ... + > + > - Plan[2]: + > Node Type: "Index Only Scan"+ > Scan Direction: "Forward" + > ... + > > and similarly for other formats. We don't really use nesting. > > > You're right, the nesting structure is too simple to require a certain > format. Do you have ideas about representing stack frames in TEXT > format? Something like what gdb does with #0, #1, etc. markers for > frame depth? Yes. The simpler the format, the better. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote:On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:
On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>
<mailto:tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>>> wrote:
...
- Attempts to get plan for simple insert queries like this
INSERT INTO x SELECT * FROM x;
end with a segfault, because ActivePortal->queryDesc is
0x0 for this
query. Needs investigation.
Yes, I've hit the same problem after submitting the latest
version of
the patch. For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful. Something I need to try.
Well, the thing is we're able to do EXPLAIN on those queries, and
IIRC auto_explain can log them too. So perhaps look into the hooks
where they take the queryDesc in those cases - it has to be
available somewhere.
Yes, this seems to work. I was able to successfully query "create table
as" and even "explain analyze" for the plans. In both cases
ActivePortal doesn't have a valid queryDesc, but the executor hook
captures one.
- The lockless approach seems fine to me, although I think
the fear
of performance issues is a bit moot (I don't think we
expect large
number of processes calling pg_cmdstatus at the same
time). But
it's not significantly more complex, so why not.
I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it
doesn't
prevent the rest of them from operating, because there is no
shared (and
thus locked) communication channel.
Sure, but I think it really deserves a bit more detailed explanation
of the protocol, and discussion of the expected behavior for some
basic failure types.
For example - what happens when a backend requests info, but dies
before receiving it, and the backed is reused for another
connection? Doesn't this introduce a race condition? Perhaps not,
I'm just asking.
Now explained better in code comments.
The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of
its slot due to the slot being marked as processed by the backend that
was asked first:
A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */
B: /* finds the slot and prepares the response */
A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */
/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */
B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */
Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.
This requires a really bad timing and the user should be able to
interrupt the querying backend A still.
I think we can't rely on the low probability that this won't happen, and we should not rely on people interrupting the backend. Being able to detect the situation and fail gracefully should be possible.
It may be possible to introduce some lock-less protocol preventing such situations, but it's not there at the moment. If you believe it's possible, you need to explain and "prove" that it's actually safe.
Otherwise we may need to introduce some basic locking - for example we may introduce a LWLock for each slot, and lock it with dontWait=true (and skip it if we couldn't lock it). This should prevent most scenarios where one corrupted slot blocks many processes.
In any case, the backends that are being asked to send the info will be
able to notice the problem (receiver detached early) and handle it
gracefully.
Ummm, how? Maybe I missed something?
I don't think we should mix this with monitoring of auxiliary
processes. This interface is designed at monitoring SQL queries
running in other backends, effectively "remote" EXPLAIN. But those
auxiliary processes are not processing SQL queries at all, they're
not even using regular executor ...
OTOH the ability to request this info (e.g. auxiliary process
looking at plans running in backends) seems useful, so I'm ok with
tuple slots for auxiliary processes.
Now that I think about it, reserving the slots for aux process doesn't
let us query their status, it's the other way round. If we don't
reserve them, then aux process would not be able to query any other
process for the status. Likely this is not a problem at all, so we can
remove these extra slots.
I don't know. I can imagine using this from background workers, but I think those are counted as regular backends (not sure though).
Well, I didn't attach the updated patch (doing that now).
Attachment
On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.
This requires a really bad timing and the user should be able to
interrupt the querying backend A still.
I think we can't rely on the low probability that this won't happen, and we should not rely on people interrupting the backend. Being able to detect the situation and fail gracefully should be possible.
It may be possible to introduce some lock-less protocol preventing such situations, but it's not there at the moment. If you believe it's possible, you need to explain and "prove" that it's actually safe.
Otherwise we may need to introduce some basic locking - for example we may introduce a LWLock for each slot, and lock it with dontWait=true (and skip it if we couldn't lock it). This should prevent most scenarios where one corrupted slot blocks many processes.OK, I will revisit this part then.
On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.
This requires a really bad timing and the user should be able to
interrupt the querying backend A still.
I think we can't rely on the low probability that this won't happen, and we should not rely on people interrupting the backend. Being able to detect the situation and fail gracefully should be possible.
It may be possible to introduce some lock-less protocol preventing such situations, but it's not there at the moment. If you believe it's possible, you need to explain and "prove" that it's actually safe.
Otherwise we may need to introduce some basic locking - for example we may introduce a LWLock for each slot, and lock it with dontWait=true (and skip it if we couldn't lock it). This should prevent most scenarios where one corrupted slot blocks many processes.OK, I will revisit this part then.I have a radical proposal to remove the need for locking: make the CmdStatusSlot struct consist of a mere dsm_handle and move all the required metadata like sender_pid, request_type, etc. into the shared memory segment itself.If we allow the only the requesting process to update the slot (that is the handle value itself) this removes the need for locking between sender and receiver.The sender will walk through the slots looking for a non-zero dsm handle (according to dsm_create() implementation 0 is considered an invalid handle), and if it finds a valid one, it will attach and look inside, to check if it's destined for this process ID. At first that might sound strange, but I would expect 99% of the time that the only valid slot would be for the process that has been just signaled.The sender process will then calculate the response message, update the result_code in the shared memory segment and finally send the message through the queue. If the receiver has since detached we get a detached result code and bail out.Clearing the slot after receiving the message should be the requesting process' responsibility. This way the receiver only writes to the slot and the sender only reads from it.By the way, is it safe to assume atomic read/writes of dsm_handle (uint32)? I would be surprised if not.
--Alex
2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:I have a radical proposal to remove the need for locking: make the CmdStatusSlot struct consist of a mere dsm_handle and move all the required metadata like sender_pid, request_type, etc. into the shared memory segment itself.If we allow the only the requesting process to update the slot (that is the handle value itself) this removes the need for locking between sender and receiver.The sender will walk through the slots looking for a non-zero dsm handle (according to dsm_create() implementation 0 is considered an invalid handle), and if it finds a valid one, it will attach and look inside, to check if it's destined for this process ID. At first that might sound strange, but I would expect 99% of the time that the only valid slot would be for the process that has been just signaled.The sender process will then calculate the response message, update the result_code in the shared memory segment and finally send the message through the queue. If the receiver has since detached we get a detached result code and bail out.Clearing the slot after receiving the message should be the requesting process' responsibility. This way the receiver only writes to the slot and the sender only reads from it.By the way, is it safe to assume atomic read/writes of dsm_handle (uint32)? I would be surprised if not.I don't see any reason why it should not to work - only few processes will wait for data - so lost attach/detach shm operations will not be too much.
Attachment
On Mon, Sep 14, 2015 at 7:27 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:... This way the receiver only writes to the slot and the sender only reads from it.By the way, is it safe to assume atomic read/writes of dsm_handle (uint32)? I would be surprised if not.I don't see any reason why it should not to work - only few processes will wait for data - so lost attach/detach shm operations will not be too much.Please see attached for implementation of this approach. The most surprising thing is that it actually works :)One problem still remains with the process requesting information when the target process exits before it can have a chance to handle the signal. The requesting process then waits forever, because nobody attaches/detaches the queue. We've discussed this before and looks like we need to introduce a timeout parameter to pg_cmdstatus()...
Attachment
On Tue, Sep 15, 2015 at 11:00 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:On Mon, Sep 14, 2015 at 7:27 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:... This way the receiver only writes to the slot and the sender only reads from it.By the way, is it safe to assume atomic read/writes of dsm_handle (uint32)? I would be surprised if not.I don't see any reason why it should not to work - only few processes will wait for data - so lost attach/detach shm operations will not be too much.Please see attached for implementation of this approach. The most surprising thing is that it actually works :)One problem still remains with the process requesting information when the target process exits before it can have a chance to handle the signal. The requesting process then waits forever, because nobody attaches/detaches the queue. We've discussed this before and looks like we need to introduce a timeout parameter to pg_cmdstatus()...I've added the timeout parameter to the pg_cmdstatus call, and more importantly to the sending side of the queue, so that one can limit the potential effect of handling the interrupt in case something goes really wrong.
I've tested a number of possible scenarios with artificial delays in reply and sending cancel request or SIGTERM to the receiving side and this is all seems to work nicely due to proper message queue detach event notification. Still I think it helps to know that there is a timeout in case the receiving side is really slow to read the message.I've also decided we really ought to suppress any possible ERROR level messages generated during the course of processing the status request in order not to prevent the originally running transaction to complete. The errors so caught are just logged using LOG level and ignored in this new version of the patch.I'm also submitting the instrumentation support as a separate patch on top of this. I'm not really fond of adding bool parameter to InstrEndLoop, but for now I didn't find any better way.What I'm now thinking about is probably we can extend this backend communication mechanism in order to query GUC values effective in a different backend or even setting the values. The obvious candidate to check when you see some query misbehaving would be work_mem, for example. Or we could provide a report of all settings that were overridden in the backend's session, to the effect of running something like this:select * from pg_settings where context = 'user' and setting != reset_val;
The obvious candidates to be set externally are the cmdstatus_analyze/instrument_*: when you decided you want to turn them on, you'd rather do that carefully for a single backend than globally per-cluster. One can still modify the postgresql.conf and then send SIGHUP to just a single backend, but I think a more direct way to alter the settings would be better.
In this light should we rename the API to something like "backend control" instead of "command status"? The SHOW/SET syntax could be extended to support the remote setting retrieval/update.
--Alex
2015-09-16 16:31 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:I've added the timeout parameter to the pg_cmdstatus call, and more importantly to the sending side of the queue, so that one can limit the potential effect of handling the interrupt in case something goes really wrong.I don't think so introduction new user visible timer is good idea. This timer should be invisibleMy idea - send a signal and wait 1sec, then check if target process is living still. Stop if not. Wait next 5 sec, then check target process. Stop if this process doesn't live or raise warning "requested process doesn't communicate, waiting.." The final cancel should be done by statement_timeout.what do you think about it?
What I'm now thinking about is probably we can extend this backend communication mechanism in order to query GUC values effective in a different backend or even setting the values. The obvious candidate to check when you see some query misbehaving would be work_mem, for example. Or we could provide a report of all settings that were overridden in the backend's session, to the effect of running something like this:select * from pg_settings where context = 'user' and setting != reset_val;this is a good idea. More times I interested what is current setting of query. We cannot to use simple query - because it is not possible to push PID to function simply, but you can to write function pg_settings_pid() so usage can look likeselect * from pg_settings_pid(xxxx, possible other params) where ...
The obvious candidates to be set externally are the cmdstatus_analyze/instrument_*: when you decided you want to turn them on, you'd rather do that carefully for a single backend than globally per-cluster. One can still modify the postgresql.conf and then send SIGHUP to just a single backend, but I think a more direct way to alter the settings would be better.I am 100% for possibility to read the setting. But reconfiguration from other process is too hot coffee - it can be available via extension, but not via usually used tools.
In this light should we rename the API to something like "backend control" instead of "command status"? The SHOW/SET syntax could be extended to support the remote setting retrieval/update.prepare API, and this functionality can be part of referential implementation in contrib.This patch should not to have too range be finished in this release cycle.
On Wed, Sep 16, 2015 at 8:07 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-16 16:31 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:I've added the timeout parameter to the pg_cmdstatus call, and more importantly to the sending side of the queue, so that one can limit the potential effect of handling the interrupt in case something goes really wrong.I don't think so introduction new user visible timer is good idea. This timer should be invisibleMy idea - send a signal and wait 1sec, then check if target process is living still. Stop if not. Wait next 5 sec, then check target process. Stop if this process doesn't live or raise warning "requested process doesn't communicate, waiting.." The final cancel should be done by statement_timeout.what do you think about it?That won't work really well with something like I use to do when testing this patch, namely:postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from pg_stat_activity where pid<>pg_backend_pid() \watch 1while also running pgbench with -C option (to create new connection for every transaction). When a targeted backend exits before it can handle the signal, the receiving process keeps waiting forever.
The statement_timeout in this case will stop the whole select, not just individual function call. Unless you wrap it to set statement_timeout and catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole select. The ability to set a (short) timeout for the function itself proves to be a really useful feature to me.
We can still communicate some warnings to the client when no timeout is specified (and make 0 the default value actually).What I'm now thinking about is probably we can extend this backend communication mechanism in order to query GUC values effective in a different backend or even setting the values. The obvious candidate to check when you see some query misbehaving would be work_mem, for example. Or we could provide a report of all settings that were overridden in the backend's session, to the effect of running something like this:select * from pg_settings where context = 'user' and setting != reset_val;this is a good idea. More times I interested what is current setting of query. We cannot to use simple query - because it is not possible to push PID to function simply, but you can to write function pg_settings_pid() so usage can look likeselect * from pg_settings_pid(xxxx, possible other params) where ...I would rather have a more general way to run *readonly* queries in the other backend than invent a new function for every occurrence.The obvious candidates to be set externally are the cmdstatus_analyze/instrument_*: when you decided you want to turn them on, you'd rather do that carefully for a single backend than globally per-cluster. One can still modify the postgresql.conf and then send SIGHUP to just a single backend, but I think a more direct way to alter the settings would be better.I am 100% for possibility to read the setting. But reconfiguration from other process is too hot coffee - it can be available via extension, but not via usually used tools.It can be reserved to superuser, and nobody is forcing one to use it anyway. :-)In this light should we rename the API to something like "backend control" instead of "command status"? The SHOW/SET syntax could be extended to support the remote setting retrieval/update.prepare API, and this functionality can be part of referential implementation in contrib.This patch should not to have too range be finished in this release cycle.These are just the thoughts on what could be achieved using this cross-backend communication mechanism and ideas for generalization of the API.
--Alex
On Tue, Sep 15, 2015 at 5:00 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > Please see attached for implementation of this approach. The most > surprising thing is that it actually works :) It's cool to see these interesting ideas for using some of the code I've been working on for the last couple of years. However, it seems to me that a better design would avoid the use of shm_mq. Instead of having the guy who asks the question create a DSM, have the guy who sends back the answer create it. Then, he can just make the DSM big enough to store the entire string. I think that's going to be a lot more robust than what you have here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 16, 2015 at 10:31 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > I've also decided we really ought to suppress any possible ERROR level > messages generated during the course of processing the status request in > order not to prevent the originally running transaction to complete. The > errors so caught are just logged using LOG level and ignored in this new > version of the patch. This plan has almost zero chance of surviving committer-level scrutiny, and if it somehow does, some other committer will scream bloody murder as soon as they notice it. It's not safe to just catch an error and continue on executing. You have to abort the (sub)transaction once an error happens. Of course, this gets at a pretty crucial design question for this patch, which is whether it's OK for one process to interrogate another process's state, and what the consequences of that might be. What permissions are needed? Can you DOS the guy running a query by pinging him for status at a very high rate? The other question here is whether it's really safe to do this. ProcessInterrupts() gets called at lots of places in the code, and we may freely add more in the future. How are you going to prove that ProcessCmdStatusInfoRequest() is safe to call in all of those places? How do you know that some data structure you find by chasing down from ActivePortal or current_query_stack doesn't have a NULL pointer someplace that you don't expect, because the code that initializes that pointer hasn't run yet? I'd like to see this made to work and be safe, but I think proving that it's truly robust in all cases is going to be hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
That won't work really well with something like I use to do when testing this patch, namely:postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from pg_stat_activity where pid<>pg_backend_pid() \watch 1while also running pgbench with -C option (to create new connection for every transaction). When a targeted backend exits before it can handle the signal, the receiving process keeps waiting forever.no - every timeout you have to check, if targeted backend is living still, if not you will do cancel. It is 100% safe.
The statement_timeout in this case will stop the whole select, not just individual function call. Unless you wrap it to set statement_timeout and catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole select. The ability to set a (short) timeout for the function itself proves to be a really useful feature to me.you cannot to handle QUERY_CANCELED in plpgsql.
There is need some internal timeout - but this timeout should not be visible - any new GUC increase PostgreSQL complexity - and there is not a reason why do it.
On Tue, Sep 15, 2015 at 5:00 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:
> Please see attached for implementation of this approach. The most
> surprising thing is that it actually works :)
It's cool to see these interesting ideas for using some of the code
I've been working on for the last couple of years. However, it seems
to me that a better design would avoid the use of shm_mq. Instead of
having the guy who asks the question create a DSM, have the guy who
sends back the answer create it. Then, he can just make the DSM big
enough to store the entire string. I think that's going to be a lot
more robust than what you have here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Sep 17, 2015 at 12:06 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:That won't work really well with something like I use to do when testing this patch, namely:postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from pg_stat_activity where pid<>pg_backend_pid() \watch 1while also running pgbench with -C option (to create new connection for every transaction). When a targeted backend exits before it can handle the signal, the receiving process keeps waiting forever.no - every timeout you have to check, if targeted backend is living still, if not you will do cancel. It is 100% safe.But then you need to make this internal timeout rather short, not 1s as originally suggested.
The statement_timeout in this case will stop the whole select, not just individual function call. Unless you wrap it to set statement_timeout and catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole select. The ability to set a (short) timeout for the function itself proves to be a really useful feature to me.you cannot to handle QUERY_CANCELED in plpgsql.Well, you can but its not that useful of course:
=# create or replace function test_query_cancel() returns void language plpgsql as $$ beginperform pg_sleep(1);exception when query_canceled then raise notice 'cancel';end; $$;CREATE FUNCTION=# set statement_timeout to '100ms';SET=# select test_query_cancel();NOTICE: canceltest_query_cancel-------------------(1 row)=# select test_query_cancel() from generate_series(1, 100) x;NOTICE: cancel^CCancel request sentNOTICE: cancel^CCancel request sentNow you cannot cancel this query unless you do pg_terminate_backend() or equivalent.There is need some internal timeout - but this timeout should not be visible - any new GUC increase PostgreSQL complexity - and there is not a reason why do it.But the GUC was added for the timeout on the sending side, not the receiving one. There is no "one value fits all" for this, but you would still want to make the effects of this as limited as possible.
--Alex
On Thu, Sep 17, 2015 at 10:28 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > Please, can you explain what is wrong on using of shm_mq? It works pretty > well now. > > There can be a contra argument, why don't use shm_mq, because it does data > exchange between processes and this patch introduce new pattern for same > thing. Sure, I can explain that. First, when you communication through a shm_mq, the writer has to wait when the queue is full, and the reader has to wait when the queue is empty. If the message is short enough to fit in the queue, then you can send it and be done without blocking. But if it is long, then you will have each process repeatedly waiting for the other one until the whole message has been sent. That is going to make sending the message potentially a lot slower than writing it all in one go, especially if the system is under heavy load. Also, if there are any bugs in the way the shm_mq is being used, they're likely to be quite rare and hard to find, because the vast majority of messages will probably be short enough to be sent in a single chunk, so whatever bugs may exist when the processes play ping-ping are unlikely to occur in practice except in unusual cases where the message being returned is very long. Second, using a shm_mq manipulates the state of the process latch. I don't think you can make the assumption that it's safe to reset the process latch at any and every place where we check for interrupts. For example, suppose the process is already using a shm_mq and the CHECK_FOR_INTERRUPTS() call inside that code then discovers that somebody has activated this mechanism and you now go try to send and receive from a new shm_mq. But even if that and every other CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset today, it's a new coding rule that could easily trip people up in the future. Using a shm_mq is appropriate when the amount of data that needs to be transmitted might be very large - too large to just allocate a buffer for the whole thing - or when the amount of data can't be predicted before memory is allocated. But there is obviously no rule that a shm_mq should be used any time we have "data exchange between processes"; we have lots of shared-memory based IPC already and have for many years, and shm_mq is newer than the vast majority of that code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 17, 2015 at 10:28 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> Please, can you explain what is wrong on using of shm_mq? It works pretty
> well now.
>
> There can be a contra argument, why don't use shm_mq, because it does data
> exchange between processes and this patch introduce new pattern for same
> thing.
Sure, I can explain that.
First, when you communication through a shm_mq, the writer has to wait
when the queue is full, and the reader has to wait when the queue is
empty. If the message is short enough to fit in the queue, then you
can send it and be done without blocking. But if it is long, then you
will have each process repeatedly waiting for the other one until the
whole message has been sent. That is going to make sending the
message potentially a lot slower than writing it all in one go,
especially if the system is under heavy load.
Also, if there are any bugs in the way the shm_mq is being used,
they're likely to be quite rare and hard to find, because the vast
majority of messages will probably be short enough to be sent in a
single chunk, so whatever bugs may exist when the processes play
ping-ping are unlikely to occur in practice except in unusual cases
where the message being returned is very long.
Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.
Using a shm_mq is appropriate when the amount of data that needs to be
transmitted might be very large - too large to just allocate a buffer
for the whole thing - or when the amount of data can't be predicted
before memory is allocated. But there is obviously no rule that a
shm_mq should be used any time we have "data exchange between
processes"; we have lots of shared-memory based IPC already and have
for many years, and shm_mq is newer than the vast majority of that
code.
On Wed, Sep 16, 2015 at 10:31 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:
> I've also decided we really ought to suppress any possible ERROR level
> messages generated during the course of processing the status request in
> order not to prevent the originally running transaction to complete. The
> errors so caught are just logged using LOG level and ignored in this new
> version of the patch.
This plan has almost zero chance of surviving committer-level
scrutiny, and if it somehow does, some other committer will scream
bloody murder as soon as they notice it.
It's not safe to just catch an error and continue on executing. You
have to abort the (sub)transaction once an error happens.
Of course, this gets at a pretty crucial design question for this
patch, which is whether it's OK for one process to interrogate another
process's state, and what the consequences of that might be. What
permissions are needed?
Can you DOS the guy running a query by
pinging him for status at a very high rate?
The other question here is whether it's really safe to do this.
ProcessInterrupts() gets called at lots of places in the code, and we
may freely add more in the future. How are you going to prove that
ProcessCmdStatusInfoRequest() is safe to call in all of those places?
How do you know that some data structure you find by chasing down from
ActivePortal or current_query_stack doesn't have a NULL pointer
someplace that you don't expect, because the code that initializes
that pointer hasn't run yet?
I'd like to see this made to work and be safe, but I think proving
that it's truly robust in all cases is going to be hard.
2015-09-17 16:46 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.It is valid, and probably most important. But if we introduce own mechanism, we will play with process latch too (although we can use LWlocks)
On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > Is there some risk if we take too much large DSM segment? And what is max > size of DSM segment? When we use shm_mq, we don't need to solve limits. I can't really think you are going to have a problem. How much data do you really intend to send back? Surely it's going to be <100kB. If you think it's not a problem to have a running query stop and send a gigabyte of data someplace anytime somebody asks, well, I don't think I agree. >> Also, if there are any bugs in the way the shm_mq is being used, >> they're likely to be quite rare and hard to find, because the vast >> majority of messages will probably be short enough to be sent in a >> single chunk, so whatever bugs may exist when the processes play >> ping-ping are unlikely to occur in practice except in unusual cases >> where the message being returned is very long. > > This is true for any functionality based on shm_mq - parallel seq scan, Parallel sequential scan is likely to put a lot more data through a shm_mq than you would for this. >> Second, using a shm_mq manipulates the state of the process latch. I >> don't think you can make the assumption that it's safe to reset the >> process latch at any and every place where we check for interrupts. >> For example, suppose the process is already using a shm_mq and the >> CHECK_FOR_INTERRUPTS() call inside that code then discovers that >> somebody has activated this mechanism and you now go try to send and >> receive from a new shm_mq. But even if that and every other >> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset >> today, it's a new coding rule that could easily trip people up in the >> future. > > It is valid, and probably most important. But if we introduce own mechanism, > we will play with process latch too (although we can use LWlocks) With the design I proposed, there is zero need to touch the process latch, which is good, because I'm pretty sure that is going to be a problem. I don't think there is any need to use LWLocks here either. When you get a request for data, you can just publish a DSM segment with the data and that's it. Why do you need anything more? You could set the requestor's latch if it's convenient; that wouldn't be a problem. But the process supplying the data can't end up in a different state than it was before supplying that data, or stuff WILL break. >> Using a shm_mq is appropriate when the amount of data that needs to be >> transmitted might be very large - too large to just allocate a buffer >> for the whole thing - or when the amount of data can't be predicted >> before memory is allocated. But there is obviously no rule that a >> shm_mq should be used any time we have "data exchange between >> processes"; we have lots of shared-memory based IPC already and have >> for many years, and shm_mq is newer than the vast majority of that >> code. > > I am little bit disappointed - I hoped so shm_mq can be used as generic > interprocess mechanism - that will ensure all corner cases for work with > shared memory. I understand to shm_mq is new, and nobody used it, but this > risk is better than invent wheels again and again. shm_mq is useful, but if you insist on using a complicated tool when a simple one is plenty sufficient, you may not get the results you're hoping for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2015-09-17 16:06 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Thu, Sep 17, 2015 at 12:06 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:That won't work really well with something like I use to do when testing this patch, namely:postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from pg_stat_activity where pid<>pg_backend_pid() \watch 1while also running pgbench with -C option (to create new connection for every transaction). When a targeted backend exits before it can handle the signal, the receiving process keeps waiting forever.no - every timeout you have to check, if targeted backend is living still, if not you will do cancel. It is 100% safe.But then you need to make this internal timeout rather short, not 1s as originally suggested.can be - 1 sec is max, maybe 100ms is optimum.The statement_timeout in this case will stop the whole select, not just individual function call. Unless you wrap it to set statement_timeout and catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole select. The ability to set a (short) timeout for the function itself proves to be a really useful feature to me.you cannot to handle QUERY_CANCELED in plpgsql.Well, you can but its not that useful of course:hmm, some is wrong - I remember from some older plpgsql, so CANCEL message is not catchable. Maybe I have bad memory. I have to recheck it.
On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> Is there some risk if we take too much large DSM segment? And what is max
> size of DSM segment? When we use shm_mq, we don't need to solve limits.
I can't really think you are going to have a problem. How much data
do you really intend to send back? Surely it's going to be <100kB.
If you think it's not a problem to have a running query stop and send
a gigabyte of data someplace anytime somebody asks, well, I don't
think I agree.
>> Also, if there are any bugs in the way the shm_mq is being used,
>> they're likely to be quite rare and hard to find, because the vast
>> majority of messages will probably be short enough to be sent in a
>> single chunk, so whatever bugs may exist when the processes play
>> ping-ping are unlikely to occur in practice except in unusual cases
>> where the message being returned is very long.
>
> This is true for any functionality based on shm_mq - parallel seq scan,
Parallel sequential scan is likely to put a lot more data through a
shm_mq than you would for this.
>> Second, using a shm_mq manipulates the state of the process latch. I
>> don't think you can make the assumption that it's safe to reset the
>> process latch at any and every place where we check for interrupts.
>> For example, suppose the process is already using a shm_mq and the
>> CHECK_FOR_INTERRUPTS() call inside that code then discovers that
>> somebody has activated this mechanism and you now go try to send and
>> receive from a new shm_mq. But even if that and every other
>> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
>> today, it's a new coding rule that could easily trip people up in the
>> future.
>
> It is valid, and probably most important. But if we introduce own mechanism,
> we will play with process latch too (although we can use LWlocks)
With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.
>> Using a shm_mq is appropriate when the amount of data that needs to be
>> transmitted might be very large - too large to just allocate a buffer
>> for the whole thing - or when the amount of data can't be predicted
>> before memory is allocated. But there is obviously no rule that a
>> shm_mq should be used any time we have "data exchange between
>> processes"; we have lots of shared-memory based IPC already and have
>> for many years, and shm_mq is newer than the vast majority of that
>> code.
>
> I am little bit disappointed - I hoped so shm_mq can be used as generic
> interprocess mechanism - that will ensure all corner cases for work with
> shared memory. I understand to shm_mq is new, and nobody used it, but this
> risk is better than invent wheels again and again.
shm_mq is useful, but if you insist on using a complicated tool when a
simple one is plenty sufficient, you may not get the results you're
hoping for.
On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> Second, using a shm_mq manipulates the state of the process latch. I
>> don't think you can make the assumption that it's safe to reset the
>> process latch at any and every place where we check for interrupts.
>> For example, suppose the process is already using a shm_mq and the
>> CHECK_FOR_INTERRUPTS() call inside that code then discovers that
>> somebody has activated this mechanism and you now go try to send and
>> receive from a new shm_mq. But even if that and every other
>> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
>> today, it's a new coding rule that could easily trip people up in the
>> future.
>
> It is valid, and probably most important. But if we introduce own mechanism,
> we will play with process latch too (although we can use LWlocks)
With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.
On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> Second, using a shm_mq manipulates the state of the process latch. I
>> don't think you can make the assumption that it's safe to reset the
>> process latch at any and every place where we check for interrupts.
>> For example, suppose the process is already using a shm_mq and the
>> CHECK_FOR_INTERRUPTS() call inside that code then discovers that
>> somebody has activated this mechanism and you now go try to send and
>> receive from a new shm_mq. But even if that and every other
>> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
>> today, it's a new coding rule that could easily trip people up in the
>> future.
>
> It is valid, and probably most important. But if we introduce own mechanism,
> we will play with process latch too (although we can use LWlocks)
With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.There is still the whole problem of where exactly the backend being queried for the status should publish that DSM segment and when to free it?If it's a location shared between all backends, there should be locking around it. Probably this is not a big problem, if you don't expect all the backends start querying each other rapidly. That is how it was implemented in the first versions of this patch actually.If we take the per-backend slot approach the locking seems unnecessary and there are principally two options:1) The backend puts the DSM handle in its own slot and notifies the requester to read it.2) The backend puts the DSM handle in the slot of the requester (and notifies it).If we go with the first option, the backend that has created the DSM will not know when it's OK to free it, so this has to be responsibility of the requester. If the latter exits before reading and freeing the DSM, we have a leak. Even bigger is the problem that the sender backend can no longer send responses to a number of concurrent requestors: if its slot is occupied by a DSM handle, it can not send a reply to another backend until the slot is freed.With the second option we have all the same problems with not knowing when to free the DSM and potentially leaking it, but we can handle concurrent requests.
The current approach where the requester creates and frees the DSM doesn't suffer from these problems, so if we pre-allocate the segment just big enough we can avoid the use of shm_mq. That will take another GUC for the segment size. Certainly no one expects a query plan to weigh a bloody megabyte, but this is what happens to Pavel apparently.
--Alex
2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:If we take the per-backend slot approach the locking seems unnecessary and there are principally two options:1) The backend puts the DSM handle in its own slot and notifies the requester to read it.2) The backend puts the DSM handle in the slot of the requester (and notifies it).If we go with the first option, the backend that has created the DSM will not know when it's OK to free it, so this has to be responsibility of the requester. If the latter exits before reading and freeing the DSM, we have a leak. Even bigger is the problem that the sender backend can no longer send responses to a number of concurrent requestors: if its slot is occupied by a DSM handle, it can not send a reply to another backend until the slot is freed.With the second option we have all the same problems with not knowing when to free the DSM and potentially leaking it, but we can handle concurrent requests.It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it.
On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:If we take the per-backend slot approach the locking seems unnecessary and there are principally two options:1) The backend puts the DSM handle in its own slot and notifies the requester to read it.2) The backend puts the DSM handle in the slot of the requester (and notifies it).If we go with the first option, the backend that has created the DSM will not know when it's OK to free it, so this has to be responsibility of the requester. If the latter exits before reading and freeing the DSM, we have a leak. Even bigger is the problem that the sender backend can no longer send responses to a number of concurrent requestors: if its slot is occupied by a DSM handle, it can not send a reply to another backend until the slot is freed.With the second option we have all the same problems with not knowing when to free the DSM and potentially leaking it, but we can handle concurrent requests.It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it.But the requester can timeout on waiting for reply and exit before it sees the reply DSM. Actually, I now don't even think a backend can free the DSM it has not created. First it will need to attach it, effectively increasing the refcount, and upon detach it will only decrease the refcount, but not actually release the segment...
So this has to be the responsibility of the reply sending backend in the end: to create and release the DSM *at some point*.--Alex
2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it.But the requester can timeout on waiting for reply and exit before it sees the reply DSM. Actually, I now don't even think a backend can free the DSM it has not created. First it will need to attach it, effectively increasing the refcount, and upon detach it will only decrease the refcount, but not actually release the segment...I am afraid so it has not simple and nice solution - when data sender will wait for to moment when data are received, then we have same complexity like we use shm_mq.Isn't better to introduce new background worker with responsibility to clean orphaned DSM?
On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it.But the requester can timeout on waiting for reply and exit before it sees the reply DSM. Actually, I now don't even think a backend can free the DSM it has not created. First it will need to attach it, effectively increasing the refcount, and upon detach it will only decrease the refcount, but not actually release the segment...I am afraid so it has not simple and nice solution - when data sender will wait for to moment when data are received, then we have same complexity like we use shm_mq.Isn't better to introduce new background worker with responsibility to clean orphaned DSM?I'm not thrilled by this idea.We don't even need a worker to do that, as leaked segments are reported by the backend itself upon transaction commit (see ResourceOwnerReleaseInternal), e.g:WARNING: dynamic shared memory leak: segment 808539725 still referencedStill I believe relying on some magic cleanup mechanism and not caring about managing the shared memory properly is not the way to go.
--Alex
On Thu, Sep 17, 2015 at 12:47 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > But if we make the sender backend create the DSM with the response payload, > it suddenly becomes really unclear at which point and who should make the > final detach of that DSM. We're getting back to the problem of > synchronization between these processes all over again. Yes, some thought is needed here. There's not really a problem if somebody asks for information just once: you could just have the publishing process keep it attached until process exit, and nothing bad would happen. The real problem comes when multiple people ask for information in quick succession: if you detach the old segment too quickly after publishing it, the guy who requested that data might not have attached it yet, and then when he gets around to looking at it, it won't be there. This seems like a pretty solvable problem, though. For example, when somebody asks for information, they store in the main shared segment a pointer to their PGPROC and then they signal the target process, which responds by publishing a dsm_segment_id. When the requesting process has accessed it, or when it errors out or exits, it clears the PGPROC and the dsm_segment_id. The publisher avoids unmapping it until that happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 18, 2015 at 6:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I am afraid so it has not simple and nice solution - when data sender will > wait for to moment when data are received, then we have same complexity like > we use shm_mq. > > Isn't better to introduce new background worker with responsibility to clean > orphaned DSM? That won't work, or at least not easily. On Windows, the DSM is cleaned up by the operating system as soon as nobody has it mapped. Frankly, I think you guys are making this out to be way more complicated than it really is. Basically, I think the process being queried should publish a DSM via a slot it owns. The recipient is responsible for reading it and then notifying the sender. If a second process requests data before the first process reads its data, the second process can either (1) become an additional reader of the already-published data or (2) wait for the first process to finish, and then send its own inquiry. There are some problems to solve here, but they hardly seem impossible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Frankly, I think you guys are making this out to be way more
complicated than it really is. Basically, I think the process being
queried should publish a DSM via a slot it owns. The recipient is
responsible for reading it and then notifying the sender. If a second
process requests data before the first process reads its data, the
second process can either (1) become an additional reader of the
already-published data or (2) wait for the first process to finish,
and then send its own inquiry.
There are some problems to solve here, but they hardly seem impossible.
The greatest part for me, is that this approach doesn't require handling of signals and is isolated in an external module, so it can be readily used with the current server versions, no need to wait for >= 9.6.
Attachment
Some problems:There is a potential problem with the limited total number of DSM segments: it is reserved in a way to only allow 2 per backend (on average) and 64 additional per server, so if you run with the new option enabled at all times, you're down to only 1 additional DSM per backend (again, on average). Not sure how critical this can be, but no one is forced to run with this option enabled for all backends.If you don't want to run it enabled at all times, then enabling the GUC per-backend can be problematic. It's still possible to update the conf file and send SIGHUP to a single backend, but it's harder to accomplish over psql, for example. I think here we might still have some luck with the signals: use another array of per-backend slots with flags, set the target backend's flag and send it SIGUSR1. The backend wakes on the signal and examines its slot, then toggles the GUC if needed. Sounds pretty safe, eh?No documentation changes yet, waiting for your comments. :-)
Happy hacking!--Alex
the preparing of content before execution is interesting idea, that can be used more. The almost queries and plans are not too big, so when the size of content is not too big - less than 1MB, then can be used one DSM for all backends.
When size of content is bigger than limit, then DSM will be allocated specially for this content. The pointer to DSM and offset can be stored in requested process slot. The reading and writing to requested slot should be protected by spinlock, but it should block only two related processes for short time (copy memory).
Other possibility is showing the size of content in requested process slot. Then the requester can preallocate enough size of shared memory. But this doesn't solve a issues related to waiting requester for content. So first variant is pretty simple, and should be preferred. The disadvantage is clear - it can enforce maybe significant slowdown of fast queries.
Alex
On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:the preparing of content before execution is interesting idea, that can be used more. The almost queries and plans are not too big, so when the size of content is not too big - less than 1MB, then can be used one DSM for all backends.When size of content is bigger than limit, then DSM will be allocated specially for this content. The pointer to DSM and offset can be stored in requested process slot. The reading and writing to requested slot should be protected by spinlock, but it should block only two related processes for short time (copy memory).Sorry, I don't think this will fly.The whole idea is that a backend publishes the plan of a query just before running it and it doesn't care which other backend(s) might be reading it, how many times and in which order. The only required locking (implicit) is contained in the code for dsm_attach/detach().
Other possibility is showing the size of content in requested process slot. Then the requester can preallocate enough size of shared memory. But this doesn't solve a issues related to waiting requester for content. So first variant is pretty simple, and should be preferred. The disadvantage is clear - it can enforce maybe significant slowdown of fast queries.Both of these approaches have just too many synchronization problems, IMO.--
Alex
2015-09-28 12:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:the preparing of content before execution is interesting idea, that can be used more. The almost queries and plans are not too big, so when the size of content is not too big - less than 1MB, then can be used one DSM for all backends.When size of content is bigger than limit, then DSM will be allocated specially for this content. The pointer to DSM and offset can be stored in requested process slot. The reading and writing to requested slot should be protected by spinlock, but it should block only two related processes for short time (copy memory).Sorry, I don't think this will fly.The whole idea is that a backend publishes the plan of a query just before running it and it doesn't care which other backend(s) might be reading it, how many times and in which order. The only required locking (implicit) is contained in the code for dsm_attach/detach().I didn't propose too different solution. There is only one difference - sharing DSM for smaller data. It is similar to using usual shared memory.
Alex
Some implementation details:For every backend that might be running (MaxBackends) we reserve a dsm_handle slot in the addins shared memory. When the new option is turned on, the ExecutorRun hook will produce a plan in whatever format is specified by the auto_explain.log_format, allocate a DSM segment, copy the plan into the segment and finally store the DSM handle into its own slot. No locking is required around this because every backend writes to its slot exclusively, no other backend can be writing into it concurrently. In the ExecutorFinish hook we invalidate the backend's slot by setting the handle to 0 and deallocate the DSM segment.
Attachment
On Mon, Sep 28, 2015 at 12:05 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-28 12:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:the preparing of content before execution is interesting idea, that can be used more. The almost queries and plans are not too big, so when the size of content is not too big - less than 1MB, then can be used one DSM for all backends.When size of content is bigger than limit, then DSM will be allocated specially for this content. The pointer to DSM and offset can be stored in requested process slot. The reading and writing to requested slot should be protected by spinlock, but it should block only two related processes for short time (copy memory).Sorry, I don't think this will fly.The whole idea is that a backend publishes the plan of a query just before running it and it doesn't care which other backend(s) might be reading it, how many times and in which order. The only required locking (implicit) is contained in the code for dsm_attach/detach().I didn't propose too different solution. There is only one difference - sharing DSM for smaller data. It is similar to using usual shared memory.Does this mean implementing some sort of allocator on top of the shared memory segment? If so, how are you going to prevent fragmentation?
--
Alex
2015-09-28 12:37 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:I didn't propose too different solution. There is only one difference - sharing DSM for smaller data. It is similar to using usual shared memory.Does this mean implementing some sort of allocator on top of the shared memory segment? If so, how are you going to prevent fragmentation?yes, simple memory allocator is necessary in this case. But it should be really simple - you can allocate only fixed size blocks - 10KB, 100KB and 1MB from separate buffers. So the fragmentation is not possible.
On Mon, Sep 28, 2015 at 7:09 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-09-28 12:37 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:I didn't propose too different solution. There is only one difference - sharing DSM for smaller data. It is similar to using usual shared memory.Does this mean implementing some sort of allocator on top of the shared memory segment? If so, how are you going to prevent fragmentation?yes, simple memory allocator is necessary in this case. But it should be really simple - you can allocate only fixed size blocks - 10KB, 100KB and 1MB from separate buffers. So the fragmentation is not possible.Maybe we're talking about completely different ideas, but I don't see how fixing the block helps to fight fragmentation, in particular. Can you sketch a simple example? E.g. 400 backends share the common segment and all of them want to publish a plan of ~1kb, for example. Now what?
--Alex
On 9/18/15 5:05 AM, Shulgin, Oleksandr wrote: > > It should not be true - the data sender create DSM and fills it. > Then set caller slot and send signal to caller. Caller can free DSM > any time, because data sender send newer touch it. > > > But the requester can timeout on waiting for reply and exit before it > sees the reply DSM. Actually, I now don't even think a backend can free > the DSM it has not created. First it will need to attach it, > effectively increasing the refcount, and upon detach it will only > decrease the refcount, but not actually release the segment... > > So this has to be the responsibility of the reply sending backend in the > end: to create and release the DSM *at some point*. What's wrong with just releasing it at the end of the statement? When the statement is done there's no point to reading it asynchronously anymore. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 9/18/15 5:05 AM, Shulgin, Oleksandr wrote:
So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.
What's wrong with just releasing it at the end of the statement? When the statement is done there's no point to reading it asynchronously anymore.
On 2015-09-25 19:13:13 +0200, Shulgin, Oleksandr wrote: > the auto_explain contrib module. I now propose the most simple thing > possible in my opinion: a new GUC option for auto_explain. It will make > every backend, in which the module is loaded via *_preload_libraries > mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the > plans of queries in dynamic shared memory as long as > auto_explain.publish_plans is turned on. Wait. You're proposing that every query does this unconditionally? For every single query? That sounds prohibitively expensive to me. Greetings, Andres Freund
On 2015-09-25 19:13:13 +0200, Shulgin, Oleksandr wrote:
> the auto_explain contrib module. I now propose the most simple thing
> possible in my opinion: a new GUC option for auto_explain. It will make
> every backend, in which the module is loaded via *_preload_libraries
> mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
> plans of queries in dynamic shared memory as long as
> auto_explain.publish_plans is turned on.
Wait. You're proposing that every query does this unconditionally? For
every single query? That sounds prohibitively expensive to me.
I now believe that use of ProcessInterrupts() in the recently proposed design as well as manipulation of proc latch due to use of shared memory queue are major blockers.In order to simplify things up, I've taken a step back and looked again at the auto_explain contrib module. I now propose the most simple thing possible in my opinion: a new GUC option for auto_explain. It will make every backend, in which the module is loaded via *_preload_libraries mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the plans of queries in dynamic shared memory as long as auto_explain.publish_plans is turned on.
The greatest part for me, is that this approach doesn't require handling of signals and is isolated in an external module, so it can be readily used with the current server versions, no need to wait for >= 9.6.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 September 2015 at 12:13, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:I now believe that use of ProcessInterrupts() in the recently proposed design as well as manipulation of proc latch due to use of shared memory queue are major blockers.In order to simplify things up, I've taken a step back and looked again at the auto_explain contrib module. I now propose the most simple thing possible in my opinion: a new GUC option for auto_explain. It will make every backend, in which the module is loaded via *_preload_libraries mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the plans of queries in dynamic shared memory as long as auto_explain.publish_plans is turned on.
The greatest part for me, is that this approach doesn't require handling of signals and is isolated in an external module, so it can be readily used with the current server versions, no need to wait for >= 9.6.This is a major change of direction, so the thread title no longer applies at all.The requirement is to be able to see the output of EXPLAIN ANALYZE for a running process. Ideally, we would dump the diagnostic output for any process that runs longer than a specific timeout, so we can decide whether to kill it, or just save that for later diagnostics.I'm interested in helping the original goal happen. Dumping an EXPLAIN, without ANALYZE info, is not at all interesting since it has no value for immediate action or post-facto diagnosis, sorry to say - making it happen for every backend just makes it even less useful.
Hitting a process with a signal and hoping it will produce a meaningful response in all circumstances without disrupting its current task was way too naive.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 September 2015 at 12:52, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:Hitting a process with a signal and hoping it will produce a meaningful response in all circumstances without disrupting its current task was way too naive.Hmm, I would have to disagree, sorry. For me the problem was dynamically allocating everything at the time the signal is received and getting into problems when that caused errors.
* INIT - Allocate N areas of memory for use by queries, which can be expanded/contracted as needed. Keep a freelist of structures.* OBSERVER - When requested, gain exclusive access to a diagnostic area, then allocate the designated process to that area, then send a signal* QUERY - When signal received dump an EXPLAIN ANALYZE to the allocated diagnostic area, (set flag to show complete, set latch on observer)* OBSERVER - process data in diagnostic area and then release area for use by next observationIf the EXPLAIN ANALYZE doesn't fit into the diagnostic chunk, LOG it as a problem and copy data only up to the size defined. Any other ERRORs that are caused by this process cause it to fail normally.
That allows the observer to be another backend, or it allows the query process to perform self-observation based upon a timeout (e.g. >1 hour) or a row limit (e.g. when an optimizer estimate is seen to be badly wrong).
On Tue, Sep 29, 2015 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:On 29 September 2015 at 12:52, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:Hitting a process with a signal and hoping it will produce a meaningful response in all circumstances without disrupting its current task was way too naive.Hmm, I would have to disagree, sorry. For me the problem was dynamically allocating everything at the time the signal is received and getting into problems when that caused errors.What I mean is that we need to move the actual EXPLAIN run out of ProcessInterrupts(). It can be still fine to trigger the communication with a signal.
* INIT - Allocate N areas of memory for use by queries, which can be expanded/contracted as needed. Keep a freelist of structures.* OBSERVER - When requested, gain exclusive access to a diagnostic area, then allocate the designated process to that area, then send a signal* QUERY - When signal received dump an EXPLAIN ANALYZE to the allocated diagnostic area, (set flag to show complete, set latch on observer)* OBSERVER - process data in diagnostic area and then release area for use by next observationIf the EXPLAIN ANALYZE doesn't fit into the diagnostic chunk, LOG it as a problem and copy data only up to the size defined. Any other ERRORs that are caused by this process cause it to fail normally.Do you envision problems if we do this with a newly allocated DSM every time instead of pre-allocated area? This will have to revert the workflow, because only the QUERY knows the required segment size:
OBSERVER - sends a signal and waits for its proc latch to be setQUERY - when signal is received allocates a DSM just big enough to fit the EXPLAIN plan, then locates the OBSERVER(s) and sets its latch (or their latches)The EXPLAIN plan should already be produced somewhere in the executor, to avoid calling into explain.c from ProcessInterrupts().That allows the observer to be another backend, or it allows the query process to perform self-observation based upon a timeout (e.g. >1 hour) or a row limit (e.g. when an optimizer estimate is seen to be badly wrong).Do you think there is one single best place in the executor code where such a check could be added? I have very little idea about that.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 29, 2015 at 1:52 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > This is not a change of the direction, but rather of the approach. Hitting > a process with a signal and hoping it will produce a meaningful response in > all circumstances without disrupting its current task was way too naive. I think that's exactly right. It's not safe to do much anything from a signal handler, and while ProcessInterrupts() is a very substantially safer, the set of things that can be safely done there is still awfully restricted. You have to cope with the fact that the function you just interrupted may be doing anything at all, and if you change anything before returning, you may knock over the apple cart. Just as bad, the state you inherit may not be very sane: we call ProcessInterrupts() from LOTS of places and there is absolutely no guarantee that every one of those places has the data structures that you need to run EXPLAIN ANALYZE in a sane state. I haven't looked at the new patch specifically so I don't have an opinion on that at this time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
This is not a change of the direction, but rather of the approach. Hitting a process with a signal and hoping it will produce a meaningful response in all circumstances without disrupting its current task was way too naive. I'd really want to make this work with ANALYZE, just not as the first step. I believe this approach can be extended to include instrumentation support (obviously we will not be able to contain this in the auto_explain module).
"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes: > I was thinking about this and what seems to be the biggest problem is when > to actually turn the feature on. It seems unlikely that someone will want > to enable it unconditionally. Enabling per-backend also doesn't seem to be > a good approach because you don't know if the next query you'd like to look > at is going to run in this exact backend. Check. > What might be actually usable is poking pg_stat_statements for queryid to > decide if we need to do explain (and possibly analyze). Hm, interesting thought. > Does this make sense to you? Does this make a good argument for merging > pg_stat_statements and auto_explain into core? I'd say more that it's a good argument for moving this feature out to one of those extensions, or perhaps building a third extension that depends on both of those. TBH, none of this stuff smells to me like something that ought to be in core. regards, tom lane
"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:
> I was thinking about this and what seems to be the biggest problem is when
> to actually turn the feature on. It seems unlikely that someone will want
> to enable it unconditionally. Enabling per-backend also doesn't seem to be
> a good approach because you don't know if the next query you'd like to look
> at is going to run in this exact backend.
Check.
> What might be actually usable is poking pg_stat_statements for queryid to
> decide if we need to do explain (and possibly analyze).
Hm, interesting thought.
> Does this make sense to you? Does this make a good argument for merging
> pg_stat_statements and auto_explain into core?
I'd say more that it's a good argument for moving this feature out to
one of those extensions, or perhaps building a third extension that
depends on both of those. TBH, none of this stuff smells to me like
something that ought to be in core.
regards, tom lane
Hello, On 15/10/2015 16:04, Pavel Stehule wrote: > > > 2015-10-15 15:42 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>>: > > "Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de > <mailto:oleksandr.shulgin@zalando.de>> writes: > > I was thinking about this and what seems to be the biggest problem is when > > to actually turn the feature on. It seems unlikely that someone will want > > to enable it unconditionally. Enabling per-backend also doesn't seem to be > > a good approach because you don't know if the next query you'd like to look > > at is going to run in this exact backend. > > Check. > > > What might be actually usable is poking pg_stat_statements for queryid to > > decide if we need to do explain (and possibly analyze). > > Hm, interesting thought. > > > Does this make sense to you? Does this make a good argument for merging > > pg_stat_statements and auto_explain into core? > > I'd say more that it's a good argument for moving this feature out to > one of those extensions, or perhaps building a third extension that > depends on both of those. TBH, none of this stuff smells to me like > something that ought to be in core. > > > There are few features, that I would to see in core: > > 1. possibility to get full SQL string > 2. possibility to get state string > > We can speak how to do it well. > I registered as reviewer on this, but after reading the whole thread for the second time, it's still not clear to me if the last two submitted patches (0001-Add-auto_explain.publish_plans.patch and 0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still needed reviews, since multiple refactoring ideas and objections have been raised since. Regards. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
I registered as reviewer on this, but after reading the whole thread for
the second time, it's still not clear to me if the last two submitted
patches (0001-Add-auto_explain.publish_plans.patch and
0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still
needed reviews, since multiple refactoring ideas and objections have
been raised since.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 30 November 2015 at 22:27, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:I registered as reviewer on this, but after reading the whole thread for
the second time, it's still not clear to me if the last two submitted
patches (0001-Add-auto_explain.publish_plans.patch and
0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still
needed reviews, since multiple refactoring ideas and objections have
been raised since.I looked into it more deeply and decided partial EXPLAINs aren't very useful yet.I've got some thoughts around that which I'll publish when I have more time. I suggest this is a good idea, just needs some serious committer-love, so we should bounce this for now.
Hi, On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote: > > I have the plans to make something from this on top of > pg_stat_statements and auto_explain, as I've mentioned last time. The > next iteration will be based on the two latest patches above, so it > still makes sense to review them. > > As for EXPLAIN ANALYZE support, it will require changes to core, but > this can be done separately. I'm re-reading the thread, and I have to admit I'm utterly confused what is the current plan, what features it's supposed to provide and whether it will solve the use case I'm most interested in. Oleksandr, could you please post a summary explaining that? My use case for this functionality is debugging of long-running queries, particularly getting EXPLAIN ANALYZE for them. In such cases I either can't run the EXPLAIN ANALYZE manually because it will either run forever (just like the query) and may not be the same (e.g. due to recent ANALYZE). So we need to extract the data from the process executing the query. I'm not essentially opposed to doing this in an extension (better an extension than nothing), but I don't quite see how you could to do that from pg_stat_statements or auto_explain. AFAIK both extensions merely use hooks before/after the executor, and therefore can't do anything in between (while the query is actually running). Perhaps you don't intend to solve this particular use case? Which use cases are you aiming to solve, then? Could you explain? Maybe all we need to do is add another hook somewhere in the executor, and push the explain analyze into pg_stat_statements once in a while, entirely eliminating the need for inter-process communication (signals, queues, ...). I'm pretty sure we don't need this for "short" queries, because in those cases we have other means to get the explain analyze (e.g. running the query again or auto_explain). So I can imagine having a rather high threshold (say, 60 seconds or even more), and we'd only push the explain analyze after crossing it. And then we'd only update it once in a while - say, again every 60 seconds. Of course, this might be configurable by two GUCs: pg_stat_statements.explain_analyze_threshold = 60 # -1 is "off" pg_stat_statements.explain_analyze_refresh = 60 FWIW I'd still prefer having "EXPLAIN ANALYZE" in core, but better this than nothing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:
I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.
As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.
I'm re-reading the thread, and I have to admit I'm utterly confused what is the current plan, what features it's supposed to provide and whether it will solve the use case I'm most interested in. Oleksandr, could you please post a summary explaining that?
My use case for this functionality is debugging of long-running queries, particularly getting EXPLAIN ANALYZE for them. In such cases I either can't run the EXPLAIN ANALYZE manually because it will either run forever (just like the query) and may not be the same (e.g. due to recent ANALYZE). So we need to extract the data from the process executing the query.
I'm not essentially opposed to doing this in an extension (better an extension than nothing), but I don't quite see how you could to do that from pg_stat_statements or auto_explain. AFAIK both extensions merely use hooks before/after the executor, and therefore can't do anything in between (while the query is actually running).
Perhaps you don't intend to solve this particular use case? Which use cases are you aiming to solve, then? Could you explain?
Maybe all we need to do is add another hook somewhere in the executor, and push the explain analyze into pg_stat_statements once in a while, entirely eliminating the need for inter-process communication (signals, queues, ...).
I'm pretty sure we don't need this for "short" queries, because in those cases we have other means to get the explain analyze (e.g. running the query again or auto_explain). So I can imagine having a rather high threshold (say, 60 seconds or even more), and we'd only push the explain analyze after crossing it. And then we'd only update it once in a while - say, again every 60 seconds.
Of course, this might be configurable by two GUCs:
pg_stat_statements.explain_analyze_threshold = 60 # -1 is "off"
pg_stat_statements.explain_analyze_refresh = 60
FWIW I'd still prefer having "EXPLAIN ANALYZE" in core, but better this than nothing.
On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote: > On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> > wrote: >> On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote: >>> >>> >>> I have the plans to make something from this on top of >>> pg_stat_statements and auto_explain, as I've mentioned last time. The >>> next iteration will be based on the two latest patches above, so it >>> still makes sense to review them. >>> >>> As for EXPLAIN ANALYZE support, it will require changes to core, but >>> this can be done separately. >> >> I'm re-reading the thread, and I have to admit I'm utterly confused what >> is the current plan, what features it's supposed to provide and whether it >> will solve the use case I'm most interested in. Oleksandr, could you please >> post a summary explaining that? >> >> My use case for this functionality is debugging of long-running queries, >> particularly getting EXPLAIN ANALYZE for them. In such cases I either can't >> run the EXPLAIN ANALYZE manually because it will either run forever (just >> like the query) and may not be the same (e.g. due to recent ANALYZE). So we >> need to extract the data from the process executing the query. >> >> I'm not essentially opposed to doing this in an extension (better an >> extension than nothing), but I don't quite see how you could to do that from >> pg_stat_statements or auto_explain. AFAIK both extensions merely use hooks >> before/after the executor, and therefore can't do anything in between (while >> the query is actually running). >> >> Perhaps you don't intend to solve this particular use case? Which use >> cases are you aiming to solve, then? Could you explain? > > Thanks for your interest in this patch! > > My motivation is the same as your use case: having a long-running query, be > able to look inside this exact query run by this exact backend. > > I admit the evolution of ideas in this patch can be very confusing: we were > trying a number of different approaches, w/o thinking deeply on the > implications, just to have a proof of concept. It's great to see ideas of concepts and things to help address those issues, at least we are agreeing that there is a hole in the instrumentation and that it would be nice to fill it with something. Still, it is not completely clear which approach would be taken. Is it fair to mark the current patch as returned with feedback then? -- Michael
On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:
> On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:> Thanks for your interest in this patch!>> On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:
>>>
>>>
>>> I have the plans to make something from this on top of
>>> pg_stat_statements and auto_explain, as I've mentioned last time. The
>>> next iteration will be based on the two latest patches above, so it
>>> still makes sense to review them.
>>>
>>> As for EXPLAIN ANALYZE support, it will require changes to core, but
>>> this can be done separately.
>>
>> I'm re-reading the thread, and I have to admit I'm utterly confused what
>> is the current plan, what features it's supposed to provide and whether it
>> will solve the use case I'm most interested in. Oleksandr, could you please
>> post a summary explaining that?
>>
>> My use case for this functionality is debugging of long-running queries,
>> particularly getting EXPLAIN ANALYZE for them. In such cases I either can't
>> run the EXPLAIN ANALYZE manually because it will either run forever (just
>> like the query) and may not be the same (e.g. due to recent ANALYZE). So we
>> need to extract the data from the process executing the query.
>>
>> I'm not essentially opposed to doing this in an extension (better an
>> extension than nothing), but I don't quite see how you could to do that from
>> pg_stat_statements or auto_explain. AFAIK both extensions merely use hooks
>> before/after the executor, and therefore can't do anything in between (while
>> the query is actually running).
>>
>> Perhaps you don't intend to solve this particular use case? Which use
>> cases are you aiming to solve, then? Could you explain?
>
>
> My motivation is the same as your use case: having a long-running query, be
> able to look inside this exact query run by this exact backend.
>
> I admit the evolution of ideas in this patch can be very confusing: we were
> trying a number of different approaches, w/o thinking deeply on the
> implications, just to have a proof of concept.
It's great to see ideas of concepts and things to help address those
issues, at least we are agreeing that there is a hole in the
instrumentation and that it would be nice to fill it with something.
Still, it is not completely clear which approach would be taken. Is it
fair to mark the current patch as returned with feedback then?
+1
--
Michael
On Thu, Dec 24, 2015 at 1:57 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > 2015-12-24 3:23 GMT+01:00 Michael Paquier <michael.paquier@gmail.com>: >> >> On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr >> <oleksandr.shulgin@zalando.de> wrote: >> > On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra >> > <tomas.vondra@2ndquadrant.com> >> > wrote: >> >> On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote: >> >>> >> >>> >> >>> I have the plans to make something from this on top of >> >>> pg_stat_statements and auto_explain, as I've mentioned last time. The >> >>> next iteration will be based on the two latest patches above, so it >> >>> still makes sense to review them. >> >>> >> >>> As for EXPLAIN ANALYZE support, it will require changes to core, but >> >>> this can be done separately. >> >> >> >> I'm re-reading the thread, and I have to admit I'm utterly confused >> >> what >> >> is the current plan, what features it's supposed to provide and whether >> >> it >> >> will solve the use case I'm most interested in. Oleksandr, could you >> >> please >> >> post a summary explaining that? >> >> >> >> My use case for this functionality is debugging of long-running >> >> queries, >> >> particularly getting EXPLAIN ANALYZE for them. In such cases I either >> >> can't >> >> run the EXPLAIN ANALYZE manually because it will either run forever >> >> (just >> >> like the query) and may not be the same (e.g. due to recent ANALYZE). >> >> So we >> >> need to extract the data from the process executing the query. >> >> >> >> I'm not essentially opposed to doing this in an extension (better an >> >> extension than nothing), but I don't quite see how you could to do that >> >> from >> >> pg_stat_statements or auto_explain. AFAIK both extensions merely use >> >> hooks >> >> before/after the executor, and therefore can't do anything in between >> >> (while >> >> the query is actually running). >> >> >> >> Perhaps you don't intend to solve this particular use case? Which use >> >> cases are you aiming to solve, then? Could you explain? >> > >> > Thanks for your interest in this patch! >> > >> > My motivation is the same as your use case: having a long-running query, >> > be >> > able to look inside this exact query run by this exact backend. >> > >> > I admit the evolution of ideas in this patch can be very confusing: we >> > were >> > trying a number of different approaches, w/o thinking deeply on the >> > implications, just to have a proof of concept. >> >> It's great to see ideas of concepts and things to help address those >> issues, at least we are agreeing that there is a hole in the >> instrumentation and that it would be nice to fill it with something. >> Still, it is not completely clear which approach would be taken. Is it >> fair to mark the current patch as returned with feedback then? > > > +1 So, done this way. We could always move it to next CF if someone wishes to move on. But at this point that would be a different approach than what was proposed first.. -- Michael