Re: [HACKERS] [POC] Faster processing at Gather node - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: [HACKERS] [POC] Faster processing at Gather node |
| Date | |
| Msg-id | 20171017213957.hylzfplumptbfvro@alap3.anarazel.de Whole thread Raw |
| In response to | [HACKERS] [POC] Faster processing at Gather node (Rafia Sabih <rafia.sabih@enterprisedb.com>) |
| Responses |
Re: [HACKERS] [POC] Faster processing at Gather node
Re: [HACKERS] [POC] Faster processing at Gather node Re: [HACKERS] [POC] Faster processing at Gather node Re: [HACKERS] [POC] Faster processing at Gather node |
| List | pgsql-hackers |
Hi Everyone,
On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies
> the receiver. Since, this copying is done in shared queue, a lot of locking
> and latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.
>
> With this patch I could see significant improvement in performance for
> simple queries,
I've spent some time looking into this, and I'm not quite convinced this
is the right approach. Let me start by describing where I see the
current performance problems around gather stemming from.
The observations here are made using
select * from t where i < 30000000 offset 29999999 - 1;
with Rafia's data. That avoids slowdowns on the leader due to too many
rows printed out. Sometimes I'll also use
SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1;
on tpch data to show the effects on wider tables.
The precise query doesn't really matter, the observations here are more
general, I hope.
1) nodeGather.c re-projects every row from workers. As far as I can tell that's now always exactly the same targetlist
asit's coming from the worker. Projection capability was added in 8538a6307049590 (without checking whether it's
neededafaict), but I think it in turn often obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7. My measurement
showsthat removing the projection yields quite massive speedups in queries like yours, which is not too surprising.
I suspect this just needs a tlist_matches_tupdesc check + an if around ExecProject(). And a test, right now tests
passunless force_parallel_mode is used even if just commenting out the projection unconditionally.
before commenting out nodeGather projection:
rafia time: 8283.583 rafia profile:
+ 30.62% postgres postgres [.] shm_mq_receive
+ 18.49% postgres postgres [.] s_lock
+ 10.08% postgres postgres [.] SetLatch
- 7.02% postgres postgres [.] slot_deform_tuple - slot_deform_tuple - 88.01% slot_getsomeattrs
ExecInterpExpr ExecGather ExecLimit
lineitem time: 8448.468 lineitem profile:
+ 24.63% postgres postgres [.] slot_deform_tuple
+ 24.43% postgres postgres [.] shm_mq_receive
+ 17.36% postgres postgres [.] ExecInterpExpr
+ 7.41% postgres postgres [.] s_lock
+ 5.73% postgres postgres [.] SetLatch
after: rafia time: 6660.224 rafia profile:
+ 36.77% postgres postgres [.] shm_mq_receive
+ 19.33% postgres postgres [.] s_lock
+ 13.14% postgres postgres [.] SetLatch
+ 9.22% postgres postgres [.] AllocSetReset
+ 4.27% postgres postgres [.] ExecGather
+ 2.79% postgres postgres [.] AllocSetAlloc
lineitem time: 4507.416 lineitem profile:
+ 34.81% postgres postgres [.] shm_mq_receive
+ 15.45% postgres postgres [.] s_lock
+ 13.38% postgres postgres [.] SetLatch
+ 9.87% postgres postgres [.] AllocSetReset
+ 5.82% postgres postgres [.] ExecGather
as quite clearly visible, avoiding the projection yields some major speedups.
The following analysis here has the projection removed.
2) The spinlocks both on the the sending and receiving side a quite hot:
rafia query leader:
+ 36.16% postgres postgres [.] shm_mq_receive
+ 19.49% postgres postgres [.] s_lock
+ 13.24% postgres postgres [.] SetLatch
The presence of s_lock shows us that we're clearly often contending on spinlocks, given that's the slow-path for
SpinLockAcquire().In shm_mq_receive the instruction profile shows:
│ SpinLockAcquire(&mq->mq_mutex); │1 5ab: mov $0xa9b580,%ecx │ mov
$0x4a4,%edx │ mov $0xa9b538,%esi │ mov %r15,%rdi │ → callq s_lock │
↑ jmpq 2a1 │ tas(): │1 5c7: mov $0x1,%eax32.83 │ lock xchg %al,(%r15) │
shm_mq_inc_bytes_read(): │ SpinLockAcquire(&mq->mq_mutex);
and 0.01 │ pop %r15 0.04 │ ← retq │ nop │ tas(): │1 338: mov
$0x1,%eax17.59│ lock xchg %al,(%r15) │ shm_mq_get_bytes_written(): │
SpinLockAcquire(&mq->mq_mutex);0.05 │ test %al,%al 0.01 │ ↓ jne 448 │ v =
mq->mq_bytes_written;
rafia query worker:
+ 33.00% postgres postgres [.] shm_mq_send_bytes
+ 9.90% postgres postgres [.] s_lock
+ 7.74% postgres postgres [.] shm_mq_send
+ 5.40% postgres postgres [.] ExecInterpExpr
+ 5.34% postgres postgres [.] SetLatch
Again, we have strong indicators for a lot of spinlock contention. The instruction profiles show the same;
shm_mq_send_bytes │ shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow)); │
and $0xfffffffffffffff8,%r15 │ tas(): 0.08 │ mov %ebp,%eax31.07 │ lock xchg
%al,(%r14) │ shm_mq_inc_bytes_written(): │ * Increment the number of bytes written. │
*/
and
│3 98: cmp %r13,%rbx 0.70 │ ↓ jae 430 │ tas(): 0.12 │1 a1: mov %ebp,%eax28.53 │
lock xchg %al,(%r14) │ shm_mq_get_bytes_read(): │ SpinLockAcquire(&mq->mq_mutex);
│ test %al,%al │ ↓ jne 298 │ v = mq->mq_bytes_read;
shm_mq_send: │ tas():50.73 │ lock xchg %al,0x0(%r13) │ shm_mq_notify_receiver():
│ shm_mq_notify_receiver(volatile shm_mq *mq) │ { │ PGPROC *receiver; │
bool detached;
My interpretation here is that it's not just the effect of the locking causing the slowdown, but to a significant
degreethe effect of the implied bus lock.
To test that theory, here are the timings for 1) spinlocks present time: 6593.045 2) spinlocks acuisition
replacedby *full* memory barriers, which on x86 is a lock; addl $0,0(%%rsp) time: 5159.306 3) spinlocks replaced
byread/write barriers as appropriate. time: 4687.610
By my understanding of shm_mq.c's logic, something like 3) aught to be doable in a correct manner. There should be,
innormal circumstances, only be one end modifying each of the protected variables. Doing so instead of using full
blockatomics with locked instructions is very likely to yield considerably better performance.
The top-level profile after 3 is:
leader:
+ 25.89% postgres postgres [.] shm_mq_receive
+ 24.78% postgres postgres [.] SetLatch
+ 14.63% postgres postgres [.] AllocSetReset
+ 7.31% postgres postgres [.] ExecGather
worker:
+ 14.02% postgres postgres [.] ExecInterpExpr
+ 11.63% postgres postgres [.] shm_mq_send_bytes
+ 11.25% postgres postgres [.] heap_getnext
+ 6.78% postgres postgres [.] SetLatch
+ 6.38% postgres postgres [.] slot_deform_tuple
still a lot of cycles in the queue code, but proportionally less.
4) Modulo computations in shm_mq are expensive:
│ shm_mq_send_bytes(): │ Size offset = mq->mq_bytes_written %
(uint64)ringsize; 0.12 │1 70: xor %edx,%edx │ Size sendnow =
Min(available,ringsize - offset); │ mov %r12,%rsi │ Size
offset= mq->mq_bytes_written % (uint64) ringsize;43.75 │ div %r12 │
memcpy(&mq->mq_ring[mq->mq_ring_offset+ offset], 7.23 │ movzbl 0x31(%r15),%eax
│ shm_mq_receive_bytes(): │ used = written - mq->mq_bytes_read; 1.17 │
sub %rax,%rcx │ offset = mq->mq_bytes_read % (uint64) ringsize;18.49 │ div %rbp
│ mov %rdx,%rdi
that we end up with a full blown div makes sense - the compiler can't know anything about ringsize, therefore it
can'toptimize to a mask. I think we should force the size of the ringbuffer to be a power of two, and use a maks
insteadof a size for the buffer.
5) There's a *lot* of latch interactions. The biggest issue actually is the memory barrier implied by a SetLatch,
waitingfor the latch barely shows up.
from 4) above:
leader:
+ 25.89% postgres postgres [.] shm_mq_receive
+ 24.78% postgres postgres [.] SetLatch
+ 14.63% postgres postgres [.] AllocSetReset
+ 7.31% postgres postgres [.] ExecGather
│ 0000000000781b10 <SetLatch>: │ SetLatch(): │ /* │ * The
memorybarrier has to be placed here to ensure that any flag │ * variables possibly changed by this
processhave been flushed to main │ * memory, before we check/set is_set. │ */
│ pg_memory_barrier();77.43 │ lock addl $0x0,(%rsp) │ │ /* Quick exit if
alreadyset */ │ if (latch->is_set) 0.12 │ mov (%rdi),%eax
Commenting out the memory barrier - which is NOT CORRECT - improves timing: before: 4675.626 after: 4125.587
The correct fix obviously is not to break latch correctness. I think the big problem here is that we perform a
SetLatch()for every read from the latch.
I think we should a) introduce a batch variant for receiving like:
extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh, shm_mq_iovec *iov, int
*iovcnt, bool nowait)
which then only does the SetLatch() at the end. And maybe if it wrapped around.
b) Use shm_mq_sendv in tqueue.c by batching up insertions into the queue whenever it's not empty when a tuple is
ready.
I've not benchmarked this, but I'm pretty certain that the benefits isn't just going to be reduced cost of
SetLatch()calls, but also increased performance due to fewer context switches
6) I've observed, using strace, debug outputs with timings, and top with a short interval, that quite often only one
backendhas sufficient work, while other backends are largely idle.
I think the problem here is that the "anti round robin" provisions from bc7fcab5e36b9597857, while much better than
theprevious state, have swung a bit too far into the other direction. Especially if we were to introduce batching as
Isuggest in 5), but even without, this leads to back-fort on half-empty queues between the gatherstate->nextreader
backend,and the leader.
I'm not 100% certain what the right fix here is.
One fairly drastic solution would be to move away from a single-produce-single-consumer style, per worker, queue, to
aglobal queue. That'd avoid fairness issues between the individual workers, at the price of potential added
contention.One disadvantage is that such a combined queue approach is not easily applicable for gather merge.
One less drastic approach would be to move to try to drain the queue fully in one batch, and then move to the next
queue.That'd avoid triggering a small wakeups for each individual tuple, as one currently would get without the
'stickyness'.
It might also be a good idea to use a more directed form of wakeups, e.g. using a per-worker latch + a wait event
set,to avoid iterating over all workers.
Unfortunately the patch's "local worker queue" concept seems, to me,
like it's not quite addressing the structural issues, instead opting to
address them by adding another layer of queuing. I suspect that if we'd
go for the above solutions there'd be only very small benefit in
implementing such per-worker local queues.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: