Home > mailing lists

Re: Reducing ClogControlLock contention - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Reducing ClogControlLock contention
Date	August 22, 2015 14:15:03
Msg-id	20150822141448.GH8552@awork2.anarazel.de Whole thread Raw
In response to	Reducing ClogControlLock contention (Simon Riggs <simon@2ndQuadrant.com>)
Responses	Re: Reducing ClogControlLock contention Re: Reducing ClogControlLock contention
List	pgsql-hackers

Tree view

Hi,

Amit pinged me about my opinion of this patch. I don't really have
time/energy for an in-depth review right now, but since I'd looked
enough to see some troublesome points, I thought I'd write those.

On 2015-06-30 08:02:25 +0100, Simon Riggs wrote:
> Proposal for improving this is to acquire the ClogControlLock in Shared
> mode, if possible.

I find that rather scary. That requires for all read and write paths in
clog.c and slru.c to only ever read memory locations once. Otherwise
those reads may not get the same results. That'd mean we'd need to add
indirections via volatile to all reads/writes. It also requires that we
never read in 4 byte quantities.

> This is safe because people checking visibility of an xid must always run
> TransactionIdIsInProgress() first to avoid race conditions, which will
> always return true for the transaction we are currently committing. As a
> result, we never get concurrent access to the same bits in clog, which
> would require a barrier.

I don't think that's really sufficient. There's a bunch of callers do
lookups without such checks, e.g. in heapam.c. It might be safe due to
other circumstances, but at the very least this is a significant but
sublte API revision.

> Two concurrent writers might access the same word concurrently, so we
> protect against that with a new CommitLock. We could partition that by
> pageno also, if needed.

To me it seems better to make this more integrated with slru.c. Change
the locking so that the control lock (relevant for page mappings et al)
is different from the locks for reading/writing data.

>       * If we're doing an async commit (ie, lsn is valid), then we must wait
>       * for any active write on the page slot to complete.  Otherwise our
>       * update could reach disk in that write, which will not do since we
> @@ -273,7 +290,9 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
>       * write-busy, since we don't care if the update reaches disk sooner than
>       * we think.
>       */
> -    slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
> +    if (!InRecovery)
> +        LWLockAcquire(CommitLock, LW_EXCLUSIVE);
> +    slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
>  
>      /*
>       * Set the main transaction id, if any.
> @@ -312,6 +331,9 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
>      ClogCtl->shared->page_dirty[slotno] = true;
>  
>      LWLockRelease(CLogControlLock);
> +
> +    if (!InRecovery)
> +        LWLockRelease(CommitLock);
>  }

TransactionIdSetPageStatus() calls TransactionIdSetStatusBit(), which
writes an 8 byte variable (the lsn). That's not safe.

Maybe write_ok = XLogRecPtrIsInvalid(lsn) is trying to address that? If
so, I don't see how. A page is returned with only the shared lock held
if it's in VALID state before. Even if that were changed, this'd be a
mightily subtle thing to do without a very fat comment.

> @@ -447,7 +447,8 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
>   * It is unspecified whether the lock will be shared or exclusive.
>   */
>  int
> -SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
> +SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, bool write_ok,
> +                           TransactionId xid)
>  {
>      SlruShared    shared = ctl->shared;
>      int            slotno;
> @@ -460,7 +461,9 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
>      {
>          if (shared->page_number[slotno] == pageno &&
>              shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
> -            shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
> +            shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS &&
> +            (write_ok ||
> +             shared->page_status[slotno] != SLRU_PAGE_WRITE_IN_PROGRESS))
>          {
>              /* See comments for SlruRecentlyUsed macro */
>              SlruRecentlyUsed(shared, slotno);
> @@ -472,7 +475,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
>      LWLockRelease(shared->ControlLock);
>      LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
>  
> -    return SimpleLruReadPage(ctl, pageno, true, xid);
> +    return SimpleLruReadPage(ctl, pageno, write_ok, xid);
>  }

This function's name would definitely need to be changed, and it'll need
to be documented when/how it's safe to use write_ok = true. Same with
slru.c's header.

A patch like this will need far more changes. Every read and write from
a page has to be guaranteed to only be done once, otherwise you can get
'effectively torn' values.

That is, you can't just do
static void
TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
...char       *byteptr;char        byteval;
.../* note this assumes exclusive access to the clog page */byteval = *byteptr;byteval &= ~(((1 << CLOG_BITS_PER_XACT)
-1) << bshift);byteval |= (status << bshift);*byteptr = byteval;

...

the compiler is entirely free to "optimize" away the byteval variable
and do all these masking operations on memory! It can intermittenly
write temporary values to byteval, because e.g. the register pressure is
too high.

To actually rely on single-copy-atomicity you have to enforce that these
reads/writes can only happen. Leavout out some possible macro
indirection that'd have to look like       byteval = (volatile char *) byteptr;       ...       *(volatile char *)
byteptr= byteval;

some for TransactionIdGetStatus(), without the write side obviously.

and stuff likeif (!XLogRecPtrIsInvalid(lsn)){    int            lsnindex = GetLSNIndex(slotno, xid);
    if (ClogCtl->shared->group_lsn[lsnindex] < lsn)        ClogCtl->shared->group_lsn[lsnindex] = lsn;}
just aren't possible at all unless we drop a bunch of platforms.

In essence I think this patch allows us to roughly benchmark whether and
how benefitial fixing the contention around clog is, but it seems pretty
far from something that can actually be applied.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Fabien COELHO
Date: 22 August 2015, 06:39:54
Subject: Re: PATCH: numeric timestamp in log_line_prefix

From: Heikki Linnakangas
Date: 22 August 2015, 14:24:59
Subject: Re: New functions

Re: Reducing ClogControlLock contention - Mailing list pgsql-hackers

Previous

Next