Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | 20210422013411.tbcaqqq6c23s2pxy@alap3.anarazel.de Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: WIP: WAL prefetch (another approach)
|
List | pgsql-hackers |
Hi, On 2021-04-21 21:21:05 -0400, Tom Lane wrote: > What I'm doing is running the core regression tests with a single > standby (on the same machine) and wal_consistency_checking = all. Do you run them over replication, or sequentially by storing data into an archive? Just curious, because its so painful to run that scenario in the replication case due to the tablespace conflicting between primary/standby, unless one disables the tablespace tests. > The other PPC machine (with no known history of trouble) is the one > that had the CRC failure I showed earlier. That one does seem to be > actual bad data in the stored WAL, because the problem was also seen > by pg_waldump, and trying to restart the standby got the same failure > again. It seems like that could also indicate an xlogreader bug that is reliably hit? Once it gets confused about record lengths or such I'd expect CRC failures... If it were actually wrong WAL contents I don't think any of the xlogreader / prefetching changes could be responsible... Have you tried reproducing it on commits before the recent xlogreader changes? commit 1d257577e08d3e598011d6850fd1025858de8c8c Author: Thomas Munro <tmunro@postgresql.org> Date: 2021-04-08 23:03:43 +1200 Optionally prefetch referenced data in recovery. commit f003d9f8721b3249e4aec8a1946034579d40d42c Author: Thomas Munro <tmunro@postgresql.org> Date: 2021-04-08 23:03:34 +1200 Add circular WAL decoding buffer. Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b Author: Thomas Munro <tmunro@postgresql.org> Date: 2021-04-08 23:03:23 +1200 Remove read_page callback from XLogReader. Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most interesting bit. > I've not been able to duplicate the consistency-check failures > there. But because that machine is a laptop with a much inferior disk > drive, the speeds are enough different that it's not real surprising > if it doesn't hit the same problem. > > I've also tried to reproduce on 32-bit and 64-bit Intel, without > success. So if this is real, maybe it's related to being big-endian > hardware? But it's also quite sensitive to $dunno-what, maybe the > history of WAL records that have already been replayed. It might just be disk speed influencing how long the tests take, which in turn increase the number of times checkpoints during the test, increasing the number of FPIs? Greetings, Andres Freund
pgsql-hackers by date: