Thread: Making wait events a bit more efficient

Making wait events a bit more efficient

From

Andres Freund

Date:

02 April 2021, 19:44:58

Hi,

This grew out of my patch to split the waits event code out of
pgstat.[ch], which in turn grew out of the shared memory stats patch
series.


pgstat_report_wait_start() and pgstat_report_wait_end() currently check
pgstat_track_activities before assigning to MyProc->wait_event_info.
Given the small cost of the assignment, and that pgstat_track_activities
is almost always used, I'm doubtful that that's the right tradeoff.

Normally I would say that branch prediction will take care of this cost
- but because pgstat_report_wait_{start,end} are inlined, that has to
happen in each of the calling locations.

The code works out to be something like the following (this is from
basebackup_read_file, the simplest caller I could quickly find, I
removed interspersed code from it):

267        if (!pgstat_track_activities || !proc)
   0x0000000000430e4d <+13>:    cmpb   $0x1,0x4882e1(%rip)        # 0x8b9135 <pgstat_track_activities>

265        volatile PGPROC *proc = MyProc;
   0x0000000000430e54 <+20>:    mov    0x48c52d(%rip),%rax        # 0x8bd388 <MyProc>

266
267        if (!pgstat_track_activities || !proc)
   0x0000000000430e5b <+27>:    jne    0x430e6c <basebackup_read_file+44>
   0x0000000000430e5d <+29>:    test   %rax,%rax
   0x0000000000430e60 <+32>:    je     0x430e6c <basebackup_read_file+44>

268            return;
269
270        /*
271         * Since this is a four-byte field which is always read and written as
272         * four-bytes, updates are atomic.
273         */
274        proc->wait_event_info = wait_event_info;
   0x0000000000430e62 <+34>:    movl   $0xa000000,0x2c8(%rax)

/home/andres/src/postgresql/src/backend/replication/basebackup.c:
2014        rc = pg_pread(fd, buf, nbytes, offset);
   0x0000000000430e6c <+44>:    call   0xc4790 <pread@plt>

stripping the source:
   0x0000000000430e4d <+13>:    cmpb   $0x1,0x4882e1(%rip)        # 0x8b9135 <pgstat_track_activities>
   0x0000000000430e54 <+20>:    mov    0x48c52d(%rip),%rax        # 0x8bd388 <MyProc>
   0x0000000000430e5b <+27>:    jne    0x430e6c <basebackup_read_file+44>
   0x0000000000430e5d <+29>:    test   %rax,%rax
   0x0000000000430e60 <+32>:    je     0x430e6c <basebackup_read_file+44>
   0x0000000000430e62 <+34>:    movl   $0xa000000,0x2c8(%rax)
   0x0000000000430e6c <+44>:    call   0xc4790 <pread@plt>


just removing the pgstat_track_activities check turns that into

   0x0000000000430d8d <+13>:    mov    0x48c5f4(%rip),%rax        # 0x8bd388 <MyProc>
   0x0000000000430d94 <+20>:    test   %rax,%rax
   0x0000000000430d97 <+23>:    je     0x430da3 <basebackup_read_file+35>
   0x0000000000430d99 <+25>:    movl   $0xa000000,0x2c8(%rax)
   0x0000000000430da3 <+35>:    call   0xc4790 <pread@plt>

which does seem (a bit) nicer.

However, we can improve this further, entirely eliminating branches, by
introducing something like "my_wait_event_info" that initially just
points to a local variable and is switched to shared once MyProc is
assigned.

Obviously incorrect, for comparison: Just removing the MyProc != NULL
check yields:
   0x0000000000430bcd <+13>:    mov    0x48c7b4(%rip),%rax        # 0x8bd388 <MyProc>
   0x0000000000430bd4 <+20>:    movl   $0xa000000,0x2c8(%rax)
   0x0000000000430bde <+30>:    call   0xc47d0 <pread@plt>

using a uint32 *my_wait_event_info yields:
   0x0000000000430b4d <+13>:    mov    0x47615c(%rip),%rax        # 0x8a6cb0 <my_wait_event_info>
   0x0000000000430b54 <+20>:    movl   $0xa000000,(%rax)
   0x0000000000430b5a <+26>:    call   0xc47d0 <pread@plt>

Note how the lack of offset addressing in the my_wait_event_info version
makes the instruction smaller (call is at 26 instead of 30).


Now, perhaps all of this isn't worth optimizing, most of the things done
within pgstat_report_wait_start()/end() are expensive-ish. And forward
branches are statically predicted to be not taken on several
platforms. I have seen this these instructions show up in profiles in
workloads with contended lwlocks at least...

There's also a small win in code size:
   text       data        bss        dec        hex    filename
8932095     192160     204656    9328911     8e590f    src/backend/postgres
8928544     192160     204656    9325360     8e4b30    src/backend/postgres_my_wait_event_info


If we went for the my_wait_event_info approach there is one further
advantage, after my change to move the wait event code into a separate
file: wait_event.h does not need to include proc.h anymore, which seems
architecturally nice for things like fd.c.


Attached is patch series doing this.


I'm inclined to separately change the comment format for
wait_event.[ch], there's no no reason to stick with the current style:

/* ----------
 * pgstat_report_wait_start() -
 *
 *    Called from places where server process needs to wait.  This is called
 *    to report wait event information.  The wait information is stored
 *    as 4-bytes where first byte represents the wait event class (type of
 *    wait, for different types of wait, refer WaitClass) and the next
 *    3-bytes represent the actual wait event.  Currently 2-bytes are used
 *    for wait event which is sufficient for current usage, 1-byte is
 *    reserved for future usage.
 *
 * NB: this *must* be able to survive being called before MyProc has been
 * initialized.
 * ----------
 */

I.e. I'd like to remove the ----- framing, the repetition of the
function name, and the varying indentation in the comment.

Greetings,

Andres Freund

Attachment

Re: Making wait events a bit more efficient

From

Zhihong Yu

Date:

02 April 2021, 20:06:35

Hi,

+extern PGDLLIMPORT uint32 *my_wait_event_info;

It seems volatile should be added to the above declaration. Since later:

+ *(volatile uint32 *) my_wait_event_info = wait_event_info;

Cheers

On Fri, Apr 2, 2021 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

This grew out of my patch to split the waits event code out of
pgstat.[ch], which in turn grew out of the shared memory stats patch
series.

pgstat_report_wait_start() and pgstat_report_wait_end() currently check
pgstat_track_activities before assigning to MyProc->wait_event_info.
Given the small cost of the assignment, and that pgstat_track_activities
is almost always used, I'm doubtful that that's the right tradeoff.

Normally I would say that branch prediction will take care of this cost
- but because pgstat_report_wait_{start,end} are inlined, that has to
happen in each of the calling locations.

The code works out to be something like the following (this is from
basebackup_read_file, the simplest caller I could quickly find, I
removed interspersed code from it):

267 if (!pgstat_track_activities || !proc)
0x0000000000430e4d <+13>: cmpb $0x1,0x4882e1(%rip) # 0x8b9135 <pgstat_track_activities>

265 volatile PGPROC *proc = MyProc;
0x0000000000430e54 <+20>: mov 0x48c52d(%rip),%rax # 0x8bd388 <MyProc>

266
267 if (!pgstat_track_activities || !proc)
0x0000000000430e5b <+27>: jne 0x430e6c <basebackup_read_file+44>
0x0000000000430e5d <+29>: test %rax,%rax
0x0000000000430e60 <+32>: je 0x430e6c <basebackup_read_file+44>

268 return;
269
270 /*
271 * Since this is a four-byte field which is always read and written as
272 * four-bytes, updates are atomic.
273 */
274 proc->wait_event_info = wait_event_info;
0x0000000000430e62 <+34>: movl $0xa000000,0x2c8(%rax)

/home/andres/src/postgresql/src/backend/replication/basebackup.c:
2014 rc = pg_pread(fd, buf, nbytes, offset);
0x0000000000430e6c <+44>: call 0xc4790 <pread@plt>

stripping the source:
0x0000000000430e4d <+13>: cmpb $0x1,0x4882e1(%rip) # 0x8b9135 <pgstat_track_activities>
0x0000000000430e54 <+20>: mov 0x48c52d(%rip),%rax # 0x8bd388 <MyProc>
0x0000000000430e5b <+27>: jne 0x430e6c <basebackup_read_file+44>
0x0000000000430e5d <+29>: test %rax,%rax
0x0000000000430e60 <+32>: je 0x430e6c <basebackup_read_file+44>
0x0000000000430e62 <+34>: movl $0xa000000,0x2c8(%rax)
0x0000000000430e6c <+44>: call 0xc4790 <pread@plt>

just removing the pgstat_track_activities check turns that into

0x0000000000430d8d <+13>: mov 0x48c5f4(%rip),%rax # 0x8bd388 <MyProc>
0x0000000000430d94 <+20>: test %rax,%rax
0x0000000000430d97 <+23>: je 0x430da3 <basebackup_read_file+35>
0x0000000000430d99 <+25>: movl $0xa000000,0x2c8(%rax)
0x0000000000430da3 <+35>: call 0xc4790 <pread@plt>

which does seem (a bit) nicer.

However, we can improve this further, entirely eliminating branches, by
introducing something like "my_wait_event_info" that initially just
points to a local variable and is switched to shared once MyProc is
assigned.

Obviously incorrect, for comparison: Just removing the MyProc != NULL
check yields:
0x0000000000430bcd <+13>: mov 0x48c7b4(%rip),%rax # 0x8bd388 <MyProc>
0x0000000000430bd4 <+20>: movl $0xa000000,0x2c8(%rax)
0x0000000000430bde <+30>: call 0xc47d0 <pread@plt>

using a uint32 *my_wait_event_info yields:
0x0000000000430b4d <+13>: mov 0x47615c(%rip),%rax # 0x8a6cb0 <my_wait_event_info>
0x0000000000430b54 <+20>: movl $0xa000000,(%rax)
0x0000000000430b5a <+26>: call 0xc47d0 <pread@plt>

Note how the lack of offset addressing in the my_wait_event_info version
makes the instruction smaller (call is at 26 instead of 30).

Now, perhaps all of this isn't worth optimizing, most of the things done
within pgstat_report_wait_start()/end() are expensive-ish. And forward
branches are statically predicted to be not taken on several
platforms. I have seen this these instructions show up in profiles in
workloads with contended lwlocks at least...

There's also a small win in code size:
text data bss dec hex filename
8932095 192160 204656 9328911 8e590f src/backend/postgres
8928544 192160 204656 9325360 8e4b30 src/backend/postgres_my_wait_event_info

If we went for the my_wait_event_info approach there is one further
advantage, after my change to move the wait event code into a separate
file: wait_event.h does not need to include proc.h anymore, which seems
architecturally nice for things like fd.c.

Attached is patch series doing this.

I'm inclined to separately change the comment format for
wait_event.[ch], there's no no reason to stick with the current style:

/* ----------
* pgstat_report_wait_start() -
*
* Called from places where server process needs to wait. This is called
* to report wait event information. The wait information is stored
* as 4-bytes where first byte represents the wait event class (type of
* wait, for different types of wait, refer WaitClass) and the next
* 3-bytes represent the actual wait event. Currently 2-bytes are used
* for wait event which is sufficient for current usage, 1-byte is
* reserved for future usage.
*
* NB: this *must* be able to survive being called before MyProc has been
* initialized.
* ----------
*/

I.e. I'd like to remove the ----- framing, the repetition of the
function name, and the varying indentation in the comment.

Greetings,

Andres Freund

Re: Making wait events a bit more efficient

From

Andres Freund

Date:

02 April 2021, 20:10:11

Hi,

On 2021-04-02 13:06:35 -0700, Zhihong Yu wrote:
> +extern PGDLLIMPORT uint32 *my_wait_event_info;
> 
> It seems volatile should be added to the above declaration. Since later:
> 
> +   *(volatile uint32 *) my_wait_event_info = wait_event_info;

Why? We really just want to make the store volatile, nothing else. I
think it's much better to annotate that we want individual stores to
happen regardless of compiler optimizations, rather than all
interactions with a variable.

Greetings,

Andres Freund

Re: Making wait events a bit more efficient

From

Zhihong Yu

Date:

02 April 2021, 20:42:42

Hi,

Maybe I am not familiar with your patch.

I don't see where my_wait_event_info is read (there is no getter method in the patch).

In that case, it is fine omitting volatile in the declaration.

Cheers

On Fri, Apr 2, 2021 at 1:10 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-04-02 13:06:35 -0700, Zhihong Yu wrote:
> +extern PGDLLIMPORT uint32 *my_wait_event_info;
>
> It seems volatile should be added to the above declaration. Since later:
>
> + *(volatile uint32 *) my_wait_event_info = wait_event_info;

Why? We really just want to make the store volatile, nothing else. I
think it's much better to annotate that we want individual stores to
happen regardless of compiler optimizations, rather than all
interactions with a variable.

Greetings,

Andres Freund

Re: Making wait events a bit more efficient

From

Andres Freund

Date:

02 April 2021, 20:44:22

Hi,

On 2021-04-02 13:42:42 -0700, Zhihong Yu wrote:
> I don't see where my_wait_event_info is read (there is no getter method in
> the patch).

There are no reads via my_wait_event_info. Once connected to shared
memory, InitProcess() calls pgstat_set_wait_event_storage() to point it
to &MyProc->wait_event_info.

Greetings,

Andres Freund

Re: Making wait events a bit more efficient

From

Andres Freund

Date:

03 April 2021, 02:55:16

Hi,

On 2021-04-02 12:44:58 -0700, Andres Freund wrote:
> If we went for the my_wait_event_info approach there is one further
> advantage, after my change to move the wait event code into a separate
> file: wait_event.h does not need to include proc.h anymore, which seems
> architecturally nice for things like fd.c.

That part turns out to make one aspect of the shared memory stats patch
cleaner, so I am planning to push this commit fairly soon, unless
somebody sees a reason not to do so?

Greetings,

Andres Freund

Re: Making wait events a bit more efficient

From

Andres Freund

Date:

03 April 2021, 19:08:36

On 2021-04-02 19:55:16 -0700, Andres Freund wrote:
> On 2021-04-02 12:44:58 -0700, Andres Freund wrote:
> > If we went for the my_wait_event_info approach there is one further
> > advantage, after my change to move the wait event code into a separate
> > file: wait_event.h does not need to include proc.h anymore, which seems
> > architecturally nice for things like fd.c.
> 
> That part turns out to make one aspect of the shared memory stats patch
> cleaner, so I am planning to push this commit fairly soon, unless
> somebody sees a reason not to do so?

Done.