Planner's choice - Mailing list pgsql-general
From | Nigel J. Andrews |
---|---|
Subject | Planner's choice |
Date | |
Msg-id | Pine.LNX.4.21.0211131510460.14887-100000@ponder.fairway2k.co.uk Whole thread Raw |
Responses |
Re: Planner's choice
|
List | pgsql-general |
Ok, I'm not sure why the planner is making the choices it is. Given: From pg_stats: tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation -----------+-------------+-----------+-----------+------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- chat_post | poster_id | 0 | 2 | 341 | {2149,1130,731,2595,1879,2473,1842,688,521,1656} | {0.066,0.039,0.0276667,0.0256667,0.0253333,0.023,0.0226667,0.022,0.021,0.021} | {4,252,582,896,1162,1526,1747,1907,2114,2472,2946} | 0.036261 chat_post | time | 0 | 8 | -0.417335 | {"1998-07-08 15:09:00-04","1999-02-26 19:31:00-05","2000-01-2712:07:00-05","2001-05-24 14:30:00-04","2002-01-22 10:04:00-05"} | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} | {"1998-03-05 22:47:00-05","1998-08-2911:49:00-04","1999-03-19 19:39:00-05","1999-08-23 09:26:00-04","2000-03-07 13:45:00-05","2000-11-2915:48:00-05","2001-04-11 11:32:00-04","2001-08-31 15:43:00-04","2002-02-01 12:40:00-05","2002-06-1906:32:00-04","2002-11-12 04:44:00-05"} | 1 with the table definition: Table "chat_post" Column | Type | Modifiers -------------+--------------------------+----------- session_id | smallint | not null poster_id | smallint | not null time | timestamp with time zone | not null post_number | smallint | not null fts | txtidx | Indexes: chat_post_text_idx, chat_post_time_idx, chat_post_timeuser_idx, chat_post_user_idx, chat_post_usertime_idx Primary key: chat_post_pkey Triggers: RI_ConstraintTrigger_23712080, RI_ConstraintTrigger_23706474 where chat_post_timeuser_idx is defined on the columns (time,poster_id) and chat_post_usertime_idx is defined on the columns (poster_id,time) Why is the planner not choosing the user_time index? For the first of these I suspect it's the sky high correlation number on the time column but for the other two? avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc limit 2; NOTICE: QUERY PLAN: Limit (cost=0.00..32.40 rows=2 width=46) (actual time=96204.53..96204.71 rows=2 loops=1) -> Index Scan Backward using chat_post_time_idx on chat_post (cost=0.00..42370.93 rows=2616 width=46) (actual time=96204.49..96204.64rows=3 loops=1) Total runtime: 96205.18 msec EXPLAIN avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by user, time desc limit 2; NOTICE: QUERY PLAN: Limit (cost=10262.07..10262.07 rows=2 width=46) (actual time=17400.89..17400.95 rows=2 loops=1) -> Sort (cost=10262.07..10262.07 rows=2616 width=46) (actual time=17400.85..17400.88 rows=3 loops=1) -> Index Scan using chat_post_user_idx on chat_post (cost=0.00..10113.60 rows=2616 width=46) (actual time=99.53..16327.97rows=3372 loops=1) Total runtime: 17450.35 msec EXPLAIN avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc, user limit 2; NOTICE: QUERY PLAN: Limit (cost=10262.07..10262.07 rows=2 width=46) (actual time=1109.68..1109.73 rows=2 loops=1) -> Sort (cost=10262.07..10262.07 rows=2616 width=46) (actual time=1109.65..1109.67 rows=3 loops=1) -> Index Scan using chat_post_user_idx on chat_post (cost=0.00..10113.60 rows=2616 width=46) (actual time=1.47..598.56rows=3372 loops=1) Total runtime: 1166.65 msec (don't forget this is a bogus time due to the caching) EXPLAIN This is part of a larger query/issue. By redoing a query sprinkling 'limit's throughout I managed to make a >90s query complete in 1s (with the loss of the total number of results that would be returned). However, I then noticed that was a feature of the particular poster_id I'd been testing with. Picking a far less frequent poster_id pushed the time up to >300s. The reason for this was the planner's choice of chat_post_time_idx which for low numbers of occurances of poster_id was effectively kicking off a seqscan through an indexscan. My plan now is to maintain my own set of poster_id stats and use one of several query variants depending on what they say but this requires at least some understanding of the choices made by the planner. BTW, the largest number of occurances of a single poster_id is still only 6% of the entire table. So changing the stats gathering shouldn't make any difference? There's also >2000 distinct values of poster_id, not the 341 estimated by the stats, but again this shouldn't matter? -- Nigel J. Andrews
pgsql-general by date: