Thread: Database Stalls
Hi,
We've started to observe instances of one of our databases stalling for a few seconds.
We see a spike in wal write locks then nothing for a few seconds. After which we have spike latency as processes waiting to get to the db can do so.
There is nothing in the postgres logs that give us any clues to what could be happening, no locks, unusually high/long running transactions, just a pause and resume.
Could anyone give me any advice as to what to look for when it comes to checking the underlying disk that the db is on?
Thanks,
Gurmokh
On Mon, Jan 30, 2023 at 05:47:49PM +0000, Mok wrote: > Hi, > > We've started to observe instances of one of our databases stalling for a > few seconds. > > We see a spike in wal write locks then nothing for a few seconds. After > which we have spike latency as processes waiting to get to the db can do > so. > > There is nothing in the postgres logs that give us any clues to what could > be happening, no locks, unusually high/long running transactions, just a > pause and resume. > > Could anyone give me any advice as to what to look for when it comes to > checking the underlying disk that the db is on? What version postgres? What settings have non-default values ? What OS/version? What environment/hardware? VM/image/provider/... Have you enabled logging for vacuum/checkpoints/locks ? https://wiki.postgresql.org/wiki/Slow_Query_Questions
On Mon, Jan 30, 2023 at 2:51 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Mon, Jan 30, 2023 at 05:47:49PM +0000, Mok wrote:
> Hi,
>
> We've started to observe instances of one of our databases stalling for a
> few seconds.
>
> We see a spike in wal write locks then nothing for a few seconds. After
> which we have spike latency as processes waiting to get to the db can do
> so.
>
> There is nothing in the postgres logs that give us any clues to what could
> be happening, no locks, unusually high/long running transactions, just a
> pause and resume.
>
> Could anyone give me any advice as to what to look for when it comes to
> checking the underlying disk that the db is on?
What version postgres? What settings have non-default values ?
What OS/version? What environment/hardware? VM/image/provider/...
Have you enabled logging for vacuum/checkpoints/locks ?
https://wiki.postgresql.org/wiki/Slow_Query_Questions
José Arthur Benetasso Villanova
Hi Burmokh,
Please take a look at this article copied below and ping me for further guidance. Thanks!
How expensive SQLs can impact PostgreSQL Performance? - https://minervadb.xyz/how-expensive-sqls-can-impact-postgresql-performance/
—
Best
Shiv
On 30-Jan-2023, at 11:17 PM, Mok <gurmokh@gmail.com> wrote:Hi,We've started to observe instances of one of our databases stalling for a few seconds.We see a spike in wal write locks then nothing for a few seconds. After which we have spike latency as processes waiting to get to the db can do so.There is nothing in the postgres logs that give us any clues to what could be happening, no locks, unusually high/long running transactions, just a pause and resume.Could anyone give me any advice as to what to look for when it comes to checking the underlying disk that the db is on?Thanks,Gurmokh
Hi,
Unfortunately there is no pg_stat_activity data available as we are unaware of the issue until it has already happened.
The version we are on is 12.11.
I don't think it is due to locks as there are none in the logs. Vacuums are logged also and none occur before or after this event. Checkpoint timeout is set to 1 hour and these events do not coincide with checkpoints.
Gurmokh
On Mon, 30 Jan 2023 at 18:47, Shiv Iyer <shiv@minervadb.com> wrote:
Hi Burmokh,Please take a look at this article copied below and ping me for further guidance. Thanks!How expensive SQLs can impact PostgreSQL Performance? - https://minervadb.xyz/how-expensive-sqls-can-impact-postgresql-performance/
—BestShivOn 30-Jan-2023, at 11:17 PM, Mok <gurmokh@gmail.com> wrote:Hi,We've started to observe instances of one of our databases stalling for a few seconds.We see a spike in wal write locks then nothing for a few seconds. After which we have spike latency as processes waiting to get to the db can do so.There is nothing in the postgres logs that give us any clues to what could be happening, no locks, unusually high/long running transactions, just a pause and resume.Could anyone give me any advice as to what to look for when it comes to checking the underlying disk that the db is on?Thanks,Gurmokh
On Mon, Jan 30, 2023 at 4:32 PM Mok <gurmokh@gmail.com> wrote:
Hi,Unfortunately there is no pg_stat_activity data available as we are unaware of the issue until it has already happened.The version we are on is 12.11.I don't think it is due to locks as there are none in the logs. Vacuums are logged also and none occur before or after this event. Checkpoint timeout is set to 1 hour and these events do not coincide with checkpoints.Gurmokh
Have you eliminated network issues? I have seen what looks like a database stalling to end up actually being the network packets taking a side trip to halfway around the world for a while. Or DNS lookups suddenly taking a really long time.
The next most likely thing is disk i/o. Do you have huge corresponding disk i/o spikes or does it drop completely to zero (which is also bad - especially if you are on a SAN and you can't get any packets out on that network). You'll have to look at your disks via OS tools to see.
Do you have any hardware faults? Errors on a hardware bus? Overheating? I used to have a system that would freeze up entirely due to a problem with a serial port that we had a console attached to - it was sending a low level interrupt. Sometimes it would recover mysteriously if someone hit the carriage return a couple times. Ie, is it _really_ the database that is locking up, or is it your hardware?
Consider creating a pg_stat_activity history table. This would allow you to look back at the time of incident and verify if any unusual activity was occurring in the database. Something like:
CREATE TABLE pg_stat_activity_hist AS SELECT now() AS sample_time, a.* FROM pg_stat_activity a WITH NO data;
Then with a cron job or a pg job scheduler insert the pg_stat_activity history at some desired interval (e.g 30s, 1m or 5m):
INSERT INTO pg_stat_activity_hist
SELECT
now(),
a.*
FROM
pg_stat_activity a
WHERE
state IN ('active', 'idle in transaction’);
Then regularly purge any sample_times older than some desired interval (1 day, 1 week, 1 month).
Not a perfect solution because the problem (if a db problem) could occur between your pg_stat_activity samples. We keep this kind of history and it is very helpful when trying to find a post-event root cause.
Craig
On Jan 30, 2023 at 10:47:49 AM, Mok <gurmokh@gmail.com> wrote:
Hi,We've started to observe instances of one of our databases stalling for a few seconds.We see a spike in wal write locks then nothing for a few seconds. After which we have spike latency as processes waiting to get to the db can do so.There is nothing in the postgres logs that give us any clues to what could be happening, no locks, unusually high/long running transactions, just a pause and resume.Could anyone give me any advice as to what to look for when it comes to checking the underlying disk that the db is on?Thanks,Gurmokh
This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.