Re: Why is pq_begintypsend so slow? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Why is pq_begintypsend so slow? |
Date | |
Msg-id | 20200114224520.gh4qktr5kuhu5pyi@alap3.anarazel.de Whole thread Raw |
In response to | Re: Why is pq_begintypsend so slow? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Why is pq_begintypsend so slow?
|
List | pgsql-hackers |
Hi, On 2020-01-13 15:18:04 -0800, Andres Freund wrote: > On 2020-01-11 22:32:45 -0500, Tom Lane wrote: > > I wrote: > > > I saw at this point that the remaining top spots were > > > enlargeStringInfo and appendBinaryStringInfo, so I experimented > > > with inlining them (again, see patch below). That *did* move > > > the needle: down to 72691 ms, or 20% better than HEAD. > > > > Oh ... marking the test in the inline part of enlargeStringInfo() > > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain. > > Might be over-optimizing for this particular case, perhaps > > > > (But ... I'm not finding these numbers to be super reproducible > > across different ASLR layouts. So take it with a grain of salt.) > > FWIW, I've also observed, in another thread (the node func generation > thing [1]), that inlining enlargeStringInfo() helps a lot, especially > when inlining some of its callers. Moving e.g. appendStringInfo() inline > allows the compiler to sometimes optimize away the strlen. But if > e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() > unconditionally, successive appends cannot optimize away memory accesses > for ->len/->data. With a set of patches doing so, int4send itself is not a significant factor for my test benchmark [1] anymore. The assembly looks about as good as one could hope, I think: # save rbx on the stack 0x00000000004b7f90 <+0>: push %rbx 0x00000000004b7f91 <+1>: sub $0x20,%rsp # store integer to be sent into rbx 0x00000000004b7f95 <+5>: mov 0x20(%rdi),%rbx # palloc length argument 0x00000000004b7f99 <+9>: mov $0x9,%edi 0x00000000004b7f9e <+14>: callq 0x5d9aa0 <palloc> # store integer in buffer (ebx is 4 byte portion of rbx) 0x00000000004b7fa3 <+19>: movbe %ebx,0x4(%rax) # store varlena header 0x00000000004b7fa8 <+24>: movl $0x20,(%rax) # restore stack and rbx registerz 0x00000000004b7fae <+30>: add $0x20,%rsp 0x00000000004b7fb2 <+34>: pop %rbx 0x00000000004b7fb3 <+35>: retq All the $0x20 constants are a bit confusing, but they just happen to be the same for int4send. It's the size of the stack frame, offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack frame again) respectively. Note that I had to annotate palloc with __attribute__((malloc)) to make the compiler understand that palloc's returned value will not alias with anything problematic (e.g. the potential of aliasing with fcinfo prevents optimizing to the above without that annotation). I think such annotations would be a good idea anyway, precisely because they allow the compiler to optimize code significantly better. These together yields about a 1.8x speedup for me. The profile shows that the overhead now is overwhelmingly elsewhere: + 26.30% postgres postgres [.] CopyOneRowTo + 13.40% postgres postgres [.] tts_buffer_heap_getsomeattrs + 10.61% postgres postgres [.] AllocSetAlloc + 9.26% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms + 7.32% postgres postgres [.] SendFunctionCall + 6.02% postgres postgres [.] palloc + 4.45% postgres postgres [.] int4send + 3.68% postgres libc-2.29.so [.] _IO_fwrite + 2.71% postgres postgres [.] heapgettup_pagemode + 1.96% postgres postgres [.] AllocSetReset + 1.83% postgres postgres [.] CopySendEndOfRow + 1.75% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5 + 1.60% postgres postgres [.] ExecStoreBufferHeapTuple + 1.57% postgres postgres [.] DoCopyTo + 1.16% postgres postgres [.] memcpy@plt + 1.07% postgres postgres [.] heapgetpage Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the generated code is still considerably better than before, yielding a 1.58x speedup. Tallocator overhead unsurprisingly is higher: + 24.93% postgres postgres [.] CopyOneRowTo + 17.10% postgres postgres [.] AllocSetAlloc + 10.09% postgres postgres [.] tts_buffer_heap_getsomeattrs + 6.50% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms + 5.99% postgres postgres [.] SendFunctionCall + 5.11% postgres postgres [.] palloc + 3.95% postgres libc-2.29.so [.] _int_malloc + 3.38% postgres postgres [.] int4send + 2.54% postgres postgres [.] heapgettup_pagemode + 2.11% postgres libc-2.29.so [.] _int_free + 2.06% postgres postgres [.] MemoryContextReset + 2.02% postgres postgres [.] AllocSetReset + 1.97% postgres libc-2.29.so [.] _IO_fwrite + 1.47% postgres postgres [.] DoCopyTo + 1.14% postgres postgres [.] ExecStoreBufferHeapTuple + 1.06% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5 + 1.04% postgres libc-2.29.so [.] malloc Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of appendBinaryStringInfo in CopySend* gains another 1.05x. This does result in some code growth, but given the size of the improvements, and that the improvements are significant even without code changes to callsites, that seems worth it. before: text data bss dec hex filename 8482739 172304 204240 8859283 872e93 src/backend/postgres after: text data bss dec hex filename 8604300 172304 204240 8980844 89096c src/backend/postgres Regards, Andres [1] CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL, c06int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL); INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000); VACUUM FREEZE lotsaints4; COPY lotsaints4 TO '/dev/null' WITH binary; CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL, c06int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL); INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000); VACUUM FREEZE lotsaints8; COPY lotsaints8 TO '/dev/null' WITH binary;
Attachment
- v1-0001-stringinfo-Move-more-functions-inline-provide-ini.patch
- v1-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch
- v1-0003-pqformat-Move-functions-for-type-sending-inline-a.patch
- v1-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch
- v1-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch
- v1-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch
pgsql-hackers by date: