Thread: How to analyze a slowdown in 9.3.5?
I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of memory. Disk is on a SAN. I have a task that runs weekly that processes possibly as many as 120 months worth of data, one month at a time. Since moving to 9.3.5 (from 8.2!!) the average time for a month has been 3 minutes or less. However, when this job ran this Tuesday, it ran fine for a number of months, but then started slowing down dramatically, 300 minutes for one month and then 167 minutes for the next. I stopped and restarted postgresql, the next block also ran really slow (157 minutes.) I then rebooted the server and the remaining blocks ran at the usual fast speed again, so restarting postgresql didn't fix the problem but rebooting the server did. Looking at the logs, I see queries with a function call that would normally take no more than 100-200 milliseconds, usually far less, that were taking 100 seconds or longer. This function gets called thousands of times for each month, so that appears to be one source of the slowdown. I don't suspect a memory leak in the calling program (in php), because since moving to this server in December this weekly task has run several times over the same range of months, making pretty much the same function calls each time. I also ran the entire range several times during testing. One change made to the server since the previous week's run was that I moved up to the latest Centos kernel (Linux version 3.10.0-123.13.2.el7.x86_64). As far as I can tell, the other virtual servers weren't being slowed down, so I don't suspect problems with the virtual server or the SAN. If this happens again, what sorts of settings in postgresq.conf or other tools should I be using to try to track down what's causing this? -- Mike Nolan
On 9.1.2015 23:14, Michael Nolan wrote: > I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of > memory. Disk is on a SAN. > > I have a task that runs weekly that processes possibly as many as > 120 months worth of data, one month at a time. Since moving to 9.3.5 > (from 8.2!!) the average time for a month has been 3 minutes or less. Congrats to migrating to a supported version! Please, comparison of the configuration used on 8.2 and 9.3.5 would be helpful (i.e. how you've updated the config on the new version?). > However, when this job ran this Tuesday, it ran fine for a number of > months, but then started slowing down dramatically, 300 minutes for > one month and then 167 minutes for the next. I stopped and restarted > postgresql, the next block also ran really slow (157 minutes.) I > then rebooted the server and the remaining blocks ran at the usual > fast speed again, so restarting postgresql didn't fix the problem > but rebooting the server did. What amounts of data are we talking about? Gigabytes? Tens of gigabytes? > > Looking at the logs, I see queries with a function call that would > normally take no more than 100-200 milliseconds, usually far less, > that were taking 100 seconds or longer. This function gets called > thousands of times for each month, so that appears to be one source > of the slowdown. But why are the functions taking so much longer? Are they eating CPU, I/O or are generally waiting for something (e.g. locks)? > > I don't suspect a memory leak in the calling program (in php), > because since moving to this server in December this weekly task has > run several times over the same range of months, making pretty much > the same function calls each time. I also ran the entire range > several times during testing. > > One change made to the server since the previous week's run was that > I moved up to the latest Centos kernel (Linux version > 3.10.0-123.13.2.el7.x86_64). And what was the previous kernel version? However, if it worked fine after rebooting the server, it may not be a kernel issue (unless it somehow depends on uptime). Is there something in the /var/log/messages? > As far as I can tell, the other virtual servers weren't being slowed > down, so I don't suspect problems with the virtual server or the SAN. > > If this happens again, what sorts of settings in postgresq.conf or > other tools should I be using to try to track down what's causing > this? Well, we don't know what the function is doing, so it'd be nice to get some basic description first. Is it querying the database? Is it inserting or updating large amounts of data? Or is it something more complex? Aside from that, some basic system stats would be really helpful, so that we can identify the bottleneck (is that a CPU, I/O, locking, ...). A few lines from these commands should give us some basic idea: iostat -x -k 5 vmstat -w 5 regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/09/2015 07:52 PM, Tomas Vondra wrote: > On 9.1.2015 23:14, Michael Nolan wrote: >> I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of >> memory. Disk is on a SAN. >> >> I have a task that runs weekly that processes possibly as many as >> 120 months worth of data, one month at a time. Since moving to 9.3.5 >> (from 8.2!!) the average time for a month has been 3 minutes or less. > > Congrats to migrating to a supported version! > > Please, comparison of the configuration used on 8.2 and 9.3.5 would be > helpful (i.e. how you've updated the config on the new version?). > >> However, when this job ran this Tuesday, it ran fine for a number of >> months, but then started slowing down dramatically, 300 minutes for >> one month and then 167 minutes for the next. I stopped and restarted >> postgresql, the next block also ran really slow (157 minutes.) I >> then rebooted the server and the remaining blocks ran at the usual >> fast speed again, so restarting postgresql didn't fix the problem >> but rebooting the server did. > > What amounts of data are we talking about? Gigabytes? Tens of gigabytes? > >> >> Looking at the logs, I see queries with a function call that would >> normally take no more than 100-200 milliseconds, usually far less, >> that were taking 100 seconds or longer. This function gets called >> thousands of times for each month, so that appears to be one source >> of the slowdown. > > But why are the functions taking so much longer? Are they eating CPU, > I/O or are generally waiting for something (e.g. locks)? > >> >> I don't suspect a memory leak in the calling program (in php), >> because since moving to this server in December this weekly task has >> run several times over the same range of months, making pretty much >> the same function calls each time. I also ran the entire range >> several times during testing. >> >> One change made to the server since the previous week's run was that >> I moved up to the latest Centos kernel (Linux version >> 3.10.0-123.13.2.el7.x86_64). > > And what was the previous kernel version? > > However, if it worked fine after rebooting the server, it may not be a > kernel issue (unless it somehow depends on uptime). Is there something > in the /var/log/messages? > At first, I was thinking, lots of activity within one transaction was messing up the stats and the planner started gettingit wrong. But a reboot wouldn't fix that. Would it? What if the reboot rolled back the db, would that stats makesense again? I have a process that makes a big temp table (with indexes). After its built if I dont run a quick analyze on it the plannernever uses the indexes right. Another thing I can think of is never commiting. If it started collecting lots and lots of row versions it could get slowerand slower. But, then, you'd see the same thing on 8.2, so, that's probably not it. Do you have any "Idle in transaction"connections? -Andy
The function is a complicated plpgsql function that makes numerous database queries, all read-only. (Other parts of that program may make changes to the database.)
The first database shutdown and the shutdown/reboot later on were both 'clean' shutdowns, so there shouldn't have been any kind of transaction rollback. --
On Sat, Jan 10, 2015 at 12:55 PM, Andy Colson <andy@squeakycode.net> wrote:
At first, I was thinking, lots of activity within one transaction was messing up the stats and the planner started getting it wrong. But a reboot wouldn't fix that. Would it? What if the reboot rolled back the db, would that stats make sense again?On 01/09/2015 07:52 PM, Tomas Vondra wrote:On 9.1.2015 23:14, Michael Nolan wrote:I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of
memory. Disk is on a SAN.
I have a task that runs weekly that processes possibly as many as
120 months worth of data, one month at a time. Since moving to 9.3.5
(from 8.2!!) the average time for a month has been 3 minutes or less.
Congrats to migrating to a supported version!
Please, comparison of the configuration used on 8.2 and 9.3.5 would be
helpful (i.e. how you've updated the config on the new version?).However, when this job ran this Tuesday, it ran fine for a number of
months, but then started slowing down dramatically, 300 minutes for
one month and then 167 minutes for the next. I stopped and restarted
postgresql, the next block also ran really slow (157 minutes.) I
then rebooted the server and the remaining blocks ran at the usual
fast speed again, so restarting postgresql didn't fix the problem
but rebooting the server did.
What amounts of data are we talking about? Gigabytes? Tens of gigabytes?
Looking at the logs, I see queries with a function call that would
normally take no more than 100-200 milliseconds, usually far less,
that were taking 100 seconds or longer. This function gets called
thousands of times for each month, so that appears to be one source
of the slowdown.
But why are the functions taking so much longer? Are they eating CPU,
I/O or are generally waiting for something (e.g. locks)?
I don't suspect a memory leak in the calling program (in php),
because since moving to this server in December this weekly task has
run several times over the same range of months, making pretty much
the same function calls each time. I also ran the entire range
several times during testing.
One change made to the server since the previous week's run was that
I moved up to the latest Centos kernel (Linux version
3.10.0-123.13.2.el7.x86_64).
And what was the previous kernel version?
However, if it worked fine after rebooting the server, it may not be a
kernel issue (unless it somehow depends on uptime). Is there something
in the /var/log/messages?
I have a process that makes a big temp table (with indexes). After its built if I dont run a quick analyze on it the planner never uses the indexes right.
Another thing I can think of is never commiting. If it started collecting lots and lots of row versions it could get slower and slower. But, then, you'd see the same thing on 8.2, so, that's probably not it. Do you have any "Idle in transaction" connections?
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Jan 9, 2015 at 7:52 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 9.1.2015 23:14, Michael Nolan wrote:
> I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of
> memory. Disk is on a SAN.
>
> I have a task that runs weekly that processes possibly as many as
> 120 months worth of data, one month at a time. Since moving to 9.3.5
> (from 8.2!!) the average time for a month has been 3 minutes or less.
Congrats to migrating to a supported version!
Yeah, it's been a long and annoying 7 years since we updated the server or database version, but I don't make the budget decisions. Going to PGCON was frustrating when nearly all the talks were about features added several versions after the one I was stuck running!
--
--
Mike Nolan
PS. Sorry about the top-posting in my last note.
PS. Sorry about the top-posting in my last note.
Just curious. Have you checked that the tables are being vacuum/analyzed periodically and that the statistics are up to date? Try running the following query to verify:
SELECT n.nspname,
s.relname,
c.reltuples::bigint,
-- n_live_tup,
n_tup_ins,
n_tup_upd,
n_tup_del,
date_trunc('second', last_vacuum) as last_vacuum,
date_trunc('second', last_autovacuum) as last_autovacuum,
date_trunc('second', last_analyze) as last_analyze,
date_trunc('second', last_autoanalyze) as last_autoanalyze
,
round( current_setting('autovacuum_vacuum_threshold')::integer + current_setting('autovacuum_vacuum_scale_factor')::numeric * C.reltuples) AS av_threshold
FROM pg_stat_all_tables s
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON (n.oid = c.relnamespace)
WHERE s.relname NOT LIKE 'pg_%'
AND s.relname NOT LIKE 'sql_%'
ORDER by 1, 2;
SELECT n.nspname,
s.relname,
c.reltuples::bigint,
-- n_live_tup,
n_tup_ins,
n_tup_upd,
n_tup_del,
date_trunc('second', last_vacuum) as last_vacuum,
date_trunc('second', last_autovacuum) as last_autovacuum,
date_trunc('second', last_analyze) as last_analyze,
date_trunc('second', last_autoanalyze) as last_autoanalyze
,
round( current_setting('autovacuum_vacuum_threshold')::integer + current_setting('autovacuum_vacuum_scale_factor')::numeric * C.reltuples) AS av_threshold
FROM pg_stat_all_tables s
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON (n.oid = c.relnamespace)
WHERE s.relname NOT LIKE 'pg_%'
AND s.relname NOT LIKE 'sql_%'
ORDER by 1, 2;
On Sat, Jan 10, 2015 at 4:11 PM, Michael Nolan <htfoot@gmail.com> wrote:
On Fri, Jan 9, 2015 at 7:52 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:On 9.1.2015 23:14, Michael Nolan wrote:
> I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of
> memory. Disk is on a SAN.
>
> I have a task that runs weekly that processes possibly as many as
> 120 months worth of data, one month at a time. Since moving to 9.3.5
> (from 8.2!!) the average time for a month has been 3 minutes or less.
Congrats to migrating to a supported version!Yeah, it's been a long and annoying 7 years since we updated the server or database version, but I don't make the budget decisions. Going to PGCON was frustrating when nearly all the talks were about features added several versions after the one I was stuck running!
--Mike Nolan
PS. Sorry about the top-posting in my last note.
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

On Sat, Jan 10, 2015 at 8:54 PM, Melvin Davidson <melvin6925@gmail.com> wrote:
Just curious. Have you checked that the tables are being vacuum/analyzed periodically and that the statistics are up to date? Try running the following query to verify:
A vacuum analyze runs every night and there would not have been many inserts or updates to the tables used by the lookup function since the latest vacuum analyze. I think I may have even done a vacuum analyze on the two largest tables after the first DB shutdown.
--
--
Mike Nolan
OK, you might also want to look at the current values of shared_buffers,
temp_buffers & work_mem in postgresql.conf.
If they seem correct/appropritate for your total shmmax memory (kernel.shmmax parameter), then if the slowdown occurs again, monitor top and see if it's really PostgreSQL that is slowing down, or perhaps some other process grabbing CPU time.
temp_buffers & work_mem in postgresql.conf.
If they seem correct/appropritate for your total shmmax memory (kernel.shmmax parameter), then if the slowdown occurs again, monitor top and see if it's really PostgreSQL that is slowing down, or perhaps some other process grabbing CPU time.
On Sun, Jan 11, 2015 at 11:07 AM, Michael Nolan <htfoot@gmail.com> wrote:
On Sat, Jan 10, 2015 at 8:54 PM, Melvin Davidson <melvin6925@gmail.com> wrote:Just curious. Have you checked that the tables are being vacuum/analyzed periodically and that the statistics are up to date? Try running the following query to verify:A vacuum analyze runs every night and there would not have been many inserts or updates to the tables used by the lookup function since the latest vacuum analyze. I think I may have even done a vacuum analyze on the two largest tables after the first DB shutdown.
--Mike Nolan
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

On Sun, Jan 11, 2015 at 8:07 AM, Michael Nolan <htfoot@gmail.com> wrote:
On Sat, Jan 10, 2015 at 8:54 PM, Melvin Davidson <melvin6925@gmail.com> wrote:Just curious. Have you checked that the tables are being vacuum/analyzed periodically and that the statistics are up to date? Try running the following query to verify:A vacuum analyze runs every night and there would not have been many inserts or updates to the tables used by the lookup function since the latest vacuum analyze. I think I may have even done a vacuum analyze on the two largest tables after the first DB shutdown.
One thing to check (I ran into this two weeks ago) -- even though vacuums were happening, a query running on a standby machine was preventing the vacuum process from removing the dead rows. You may want to check for bloat or use 'vacuum verbose' to see if there's many dead rows not being cleaned up.
Joe
For what it's worth, this week's run covered even more months than last week's did, and ran in about 5 1/2 hours, with no slowdowns, under a similar system load. So, it could have been a one-time thing or some combination of factors that will be difficult to reproduce. -- Mike Nolan