Thread: 2 machines, same database, same query, 10 times slower?
Hi, I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same hardware, with the same database layout, they have different data, and the same query run 10 times as slow on one machine compared to the other. I have run ANALYZE on both machines, here are the query plans for both machines. Can someone explain this to me? I have trouble reading EXPLAIN output... Query: explain select company.name as cname, call_flags, bill_duration as call_duration, date_part('epoch',start_time) as start_time, src_name, src_type, src_num, src_id, dial_name, dial_type, dial_num, dial_id, ans_name, ans_type, ans_num, ans_id, sessionid from cdr, company, phoneline, contact where (src_id = contact.id or dial_id = contact.id or ans_id = contact.id) and contact.id = '2' and phoneline.function='contact' and phoneline.lookupid = contact.id and phoneline.status != 'deleted' and (src_company=company.id or dial_company=company.id) and company.id > 2 order by start_time DESC limit 10; This is the query plan on machine #1 (query takes 2 seconds) : Limit (cost=106128.33..106128.36 rows=10 width=160) -> Sort (cost=106128.33..106166.98 rows=15458 width=160) Sort Key: date_part('epoch'::text, cdr.start_time) -> Nested Loop (cost=49.38..104275.65 rows=15458 width=160) -> Nested Loop (cost=2.10..103880.57 rows=7729 width=164) Join Filter: (("outer".src_company = "inner".id) OR ("outer".dial_company = "inner".id)) -> Nested Loop (cost=0.00..103054.09 rows=6595 width=163) Join Filter: (("inner".src_id = "outer".id) OR ("inner".dial_id = "outer".id) OR ("inner".ans_id= "outer".id)) -> Index Scan using contact_pkey on contact (cost=0.00..5.94 rows=1 width=4) Index Cond: (id = 2) -> Seq Scan on cdr (cost=0.00..77039.87 rows=1486187 width=159) -> Materialize (cost=2.10..2.16 rows=5 width=13) -> Seq Scan on company (cost=0.00..2.10 rows=5 width=13) Filter: (id > 2) -> Materialize (cost=47.28..47.30 rows=2 width=4) -> Seq Scan on phoneline (cost=0.00..47.28 rows=2 width=4) Filter: ((("function")::text = 'contact'::text) AND ((status)::text <> 'deleted'::text) AND (lookupid= 2)) (17 rows) This is the query plan on machine two (query takes 38 seconds): Limit (cost=424555.76..424555.79 rows=10 width=170) -> Sort (cost=424555.76..424574.34 rows=7432 width=170) Sort Key: date_part('epoch'::text, cdr.start_time) -> Nested Loop (cost=422353.60..424077.90 rows=7432 width=170) -> Nested Loop (cost=422064.10..423621.19 rows=3716 width=174) Join Filter: (("inner".src_company = "outer".id) OR ("inner".dial_company = "outer".id)) -> Bitmap Heap Scan on company (cost=2.09..49.23 rows=26 width=21) Recheck Cond: (id > 2) -> Bitmap Index Scan on company_pkey (cost=0.00..2.09 rows=26 width=0) Index Cond: (id > 2) -> Materialize (cost=422062.01..422085.24 rows=2323 width=165) -> Nested Loop (cost=0.00..422059.69 rows=2323 width=165) Join Filter: (("inner".src_id = "outer".id) OR ("inner".dial_id = "outer".id) OR ("inner".ans_id= "outer".id)) -> Index Scan using contact_pkey on contact (cost=0.00..6.01 rows=1 width=4) Index Cond: (id = 2) -> Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161) -> Materialize (cost=289.50..289.52 rows=2 width=4) -> Seq Scan on phoneline (cost=0.00..289.50 rows=2 width=4) Filter: ((("function")::text = 'contact'::text) AND ((status)::text <> 'deleted'::text) AND (lookupid= 2)) (19 rows) Thanks, Antonio
Hi, On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote: > Hi, > > I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same > hardware, with the same database layout, > they have different data, and the same query run 10 times as slow on one > machine compared to the other. First of all, to analyze runtime differences it's important to provide EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use explain.depesz.com to post the output. Second, what do you mean 'different data'? If there is different amount of data, it may be perfectly expected that the query runs much slower on the machine with more data. For example the plans contain this: A: Seq Scan on cdr (cost=0.00..77039.87 rows=1486187 width=159) B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161) That suggests that the second database contains about 1/2 the rows. The seq scan nodes reveal another interesting fact - while the expected row count is about 50% in the second plan, the estimated cost is about 5x higher (both compared to the first plan). The important thing here is that most of the cost estimate comes from the number of pages, therefore I suppose the cdr occupies about 5x the space in the second case, although it's much more 'sparse'. Do this on both machines to verify that SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr'; That might happen for example by deleting a lot of rows recently (without running VACUUM FULL after) or by not running autovacuum at all. Which is quite likely, because it was introduced in 8.1 and was off by default. BTW if you care about performance, you should upgrade to a more recent version (preferably 9.x) because 8.1 is not supported for several years IIRC and there were many improvements since then. Tomas
Hi Tomas, thanks for responding. Op 08-05-12 17:34, Tomas Vondra schreef: > Hi, > > On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote: >> Hi, >> >> I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same >> hardware, with the same database layout, >> they have different data, and the same query run 10 times as slow on one >> machine compared to the other. > > First of all, to analyze runtime differences it's important to provide > EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use > explain.depesz.com to post the output. > Allright, thanks, didn't know that. Reran the queries, and they are posted here: The slow one: http://explain.depesz.com/s/2Si The fast one: http://explain.depesz.com/s/c9m3 > Second, what do you mean 'different data'? If there is different amount of > data, it may be perfectly expected that the query runs much slower on the > machine with more data. For example the plans contain this: > > A: Seq Scan on cdr (cost=0.00..77039.87 rows=1486187 width=159) > B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161) > > That suggests that the second database contains about 1/2 the rows. > That is true. > The seq scan nodes reveal another interesting fact - while the expected > row count is about 50% in the second plan, the estimated cost is about 5x > higher (both compared to the first plan). > > The important thing here is that most of the cost estimate comes from the > number of pages, therefore I suppose the cdr occupies about 5x the space > in the second case, although it's much more 'sparse'. > > Do this on both machines to verify that > > SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr'; Slow machine: relpages | reltuples ----------+----------- 400566 | 982321 Fast machine: relpages | reltuples ----------+------------- 62076 | 1.48375e+06 > > That might happen for example by deleting a lot of rows recently (without > running VACUUM FULL after) or by not running autovacuum at all. Which is > quite likely, because it was introduced in 8.1 and was off by default. > Autovacuum is running on both machines and does not report errors. But I did not run a vacuum full. There currently are users on the machine, so I can try that later tonight. > BTW if you care about performance, you should upgrade to a more recent > version (preferably 9.x) because 8.1 is not supported for several years > IIRC and there were many improvements since then. > I would like to, but I am bound to distribution-supplied software versions. Thanks a lot for helping, Antonio > Tomas >
2012/5/8 Antonio Goméz Soto <antonio.gomez.soto@gmail.com>: > Hi Tomas, > > thanks for responding. > > Op 08-05-12 17:34, Tomas Vondra schreef: >> Hi, >> >> On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote: >>> Hi, >>> >>> I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same >>> hardware, with the same database layout, >>> they have different data, and the same query run 10 times as slow on one >>> machine compared to the other. >> >> First of all, to analyze runtime differences it's important to provide >> EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use >> explain.depesz.com to post the output. >> > > Allright, thanks, didn't know that. Reran the queries, and they are posted here: > > The slow one: http://explain.depesz.com/s/2Si > > The fast one: http://explain.depesz.com/s/c9m3 > probably cdr table on "slow machine" needs VACUUM FULL. Regards Pavel >> Second, what do you mean 'different data'? If there is different amount of >> data, it may be perfectly expected that the query runs much slower on the >> machine with more data. For example the plans contain this: >> >> A: Seq Scan on cdr (cost=0.00..77039.87 rows=1486187 width=159) >> B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161) >> >> That suggests that the second database contains about 1/2 the rows. >> > > That is true. > >> The seq scan nodes reveal another interesting fact - while the expected >> row count is about 50% in the second plan, the estimated cost is about 5x >> higher (both compared to the first plan). >> >> The important thing here is that most of the cost estimate comes from the >> number of pages, therefore I suppose the cdr occupies about 5x the space >> in the second case, although it's much more 'sparse'. >> >> Do this on both machines to verify that >> >> SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr'; > > Slow machine: > > relpages | reltuples > ----------+----------- > 400566 | 982321 > > Fast machine: > > relpages | reltuples > ----------+------------- > 62076 | 1.48375e+06 > > >> >> That might happen for example by deleting a lot of rows recently (without >> running VACUUM FULL after) or by not running autovacuum at all. Which is >> quite likely, because it was introduced in 8.1 and was off by default. >> > > Autovacuum is running on both machines and does not report errors. But > I did not run a vacuum full. There currently are users on the machine, > so I can try that later tonight. > >> BTW if you care about performance, you should upgrade to a more recent >> version (preferably 9.x) because 8.1 is not supported for several years >> IIRC and there were many improvements since then. >> > > I would like to, but I am bound to distribution-supplied software versions. > > Thanks a lot for helping, > Antonio > >> Tomas >> > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On 8.5.2012 20:32, Antonio Goméz Soto wrote: >> > Do this on both machines to verify that >> > >> > SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr'; > Slow machine: > > relpages | reltuples > ----------+----------- > 400566 | 982321 > > Fast machine: > > relpages | reltuples > ----------+------------- > 62076 | 1.48375e+06 > > Yup, that's probably the culprit. The slow machine has to read 400000 pages while the fast one just 62000. Each page is 8kB, so this is 3GB vs. 500MB difference. T.
Tomas, running vacuum full cdr brought down the query time to 4 seconds.. Thanks a lot, Antonio Op 08-05-12 21:15, Tomas Vondra schreef: > On 8.5.2012 20:32, Antonio Goméz Soto wrote: >>>> Do this on both machines to verify that >>>> >>>> SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr'; >> Slow machine: >> >> relpages | reltuples >> ----------+----------- >> 400566 | 982321 >> >> Fast machine: >> >> relpages | reltuples >> ----------+------------- >> 62076 | 1.48375e+06 >> >> > > Yup, that's probably the culprit. The slow machine has to read 400000 > pages while the fast one just 62000. Each page is 8kB, so this is 3GB > vs. 500MB difference. > > T. >
On 9.5.2012 22:50, Antonio Goméz Soto wrote: > Tomas, > > running vacuum full cdr brought down the query time to 4 seconds.. Fine. I'd recommend checking the other tables too - if this one was bloated, it's likely there might be others. Anyway, you should find out how this happened and fix it. This kind of bloat is usually caused by either frequent changes (updated / deletes) without sufficient vacuuming, or batch updates hitting many rows at once (e.g. during maintenance). Or maybe the vacuum was not running at all - I do remember there were frequent issues with free-space map. See this: http://www.varlena.com/GeneralBits/Tidbits/perf.html#maxfsmp http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server That should give you a good idea. Tomas