Re: INDEX suggestion needed - Mailing list pgsql-general
From | Thomas Beutin |
---|---|
Subject | Re: INDEX suggestion needed |
Date | |
Msg-id | 20021213164138.C17113@laokoon.bug.net Whole thread Raw |
In response to | Re: INDEX suggestion needed (Manfred Koizar <mkoi-pg@aon.at>) |
Responses |
Re: INDEX suggestion needed
Re: INDEX suggestion needed |
List | pgsql-general |
Hi, at first thanks to all people for help! On Thu, Dec 12, 2002 at 10:00:48PM +0100, Manfred Koizar wrote: > On Thu, 12 Dec 2002 20:13:24 +0100, Thomas Beutin > <tyrone@laokoon.IN-Berlin.DE> wrote: [...] > >tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-06-01'AND visit <= '2002-12-11'); > > This selects (almost) all rows. An index cannot help. > > >i got the following index/sequence scans by date ranges: > >(visit >= '2002-06-01' AND visit <= '2002-06-30') index scan > > 0 rows satisfy this condition, index scan is ok > > >(visit >= '2002-06-01' AND visit <= '2002-07-31') index scan > > Ca. 10000 rows, 3% of the whole table, index scan ok Ahh, now i get a clear look into index usability ;-) > >(visit >= '2002-06-01' AND visit <= '2002-08-31') sequence scan > > 32%, sequence scan is expected to be faster, unless tuples are almost > perfectly ordered by visit or most of the table (~ 8000 pages) fits > into the cache. How much physical memory is installed? What are your > shared_buffers and effective_cache_size settings? You might want to > experiment with > SET enable_seqscan = off; Ok, this queries are running on my developer workstation (notebook): 550 MHz mobile celeron, 128 MB RAM, 512MB swap space, IDE running a apache/php, postgres 7.2.3, X11/gnome and mozilla 1.2.1 This are my entries in postgresql.conf: #max_connections = 32 #shared_buffers = 64 # 2*max_connections, min 16 #effective_cache_size = 1000 # default in 8k pages so i'm guessing i run the default values. Playing around with enable_seqscan = on/off shows the planner's right choices. Actually i cannot connect to the production server for stats, but the hardware is: 2x iPIII 850MHz, 2GB RAM, 1GB swap, SCSI RAID5 for database running a apache/php and postgres 7.2.? Should be more reasonable... [...] > >(visit >= '2002-10-01' AND visit <= '2002-10-31') index scan, but long (>5sec) > > Ca. 3%; interesting that it takes so long; I'd like to see EXPLAIN > ANALYZE output for enable_seqscan on and off. tb=# set enable_seqscan=on; SET VARIABLE itb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01' ANDvisit <= '2002-10-31'); NOTICE: QUERY PLAN: Aggregate (cost=8788.75..8788.75 rows=1 width=34) (actual time=4663.69..4663.70 rows=1 loops=1) -> Index Scan using tb5 on stat_pages (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..468.35 rows=29937 loops=1) Total runtime: 4663.99 msec EXPLAIN tb=# set enable_seqscan=off; SET VARIABLE tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01' ANDvisit <= '2002-10-31'); NOTICE: QUERY PLAN: Aggregate (cost=8788.75..8788.75 rows=1 width=34) (actual time=4788.35..4788.35 rows=1 loops=1) -> Index Scan using tb5 on stat_pages (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..439.44 rows=29937 loops=1) Total runtime: 4788.65 msec EXPLAIN There is no difference in cost. > >And: The date range in my table is from 2002-07-10 11:36:53+02 up to > >2002-10-29 23:31:47+01. > > Yes, this is approximately reflected by the histogram bounds. > > > attname | null_frac | avg_wi | n_distinct | correlation > >---------+-----------+--------+------------+------------- > > visit | 0 | 8 | -0.543682 | -0.972118 > > The negative correlation looks strange. How did you insert your data? It is a dump from the production system, and the production system gets the data once a day from webserver logs line by line. > > m_id | 0 | 2 | 1 | 1 > ^^^ > Only one distinct value in m_id? This explains why your m_id index is > never used. Yes, because i copied only a part of the data, it's still a lot for my developer system ;-) I will playing around on the production system next week, may be i give combined indicees (m_id, visit) with enable_seqscan=on/off a try. > > attname | most_common_vals > >---------+-------------------------- > > visit | "2002-08-21 10:29:10+02", ... > > m_id | 35 > > > attname | histogram_bounds > >---------+-------------------------- > > visit | "2002-07-25 16:37:12+02" > > "2002-08-15 12:36:18+02" > > "2002-08-23 12:36:15+02" > > "2002-08-29 17:30:54+02" > > "2002-09-05 12:54:31+02" > > "2002-09-10 18:03:54+02" > > "2002-09-16 15:44:56+02" > > "2002-09-20 14:34:40+02" > > "2002-09-24 13:59:29+02" > > "2002-09-29 09:09:31+02" > > "2002-10-29 23:25:13+01" > > m_id | > > > > >??? Is this output ok? > > Almost. I forgot to ask for most_common_freqs (cut'n'paste error). Here is the result: tb=# SELECT attname, null_frac, avg_width, n_distinct, most_common_vals, most_common_freqs, histogram_bounds, correlationFROM pg_stats WHERE tablename = 'stat_pages' AND attname IN ('m_id', 'visit'); attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- visit | 0 | 8 | -0.465972 | {"2002-08-21 10:31:18+02","2002-08-28 15:28:04+02","2002-09-02 08:50:08+02","2002-09-0213:48:49+02","2002-09-04 13:00:03+02","2002-09-06 18:55:19+02","2002-09-12 15:24:14+02","2002-09-1313:12:39+02","2002-09-18 12:55:07+02","2002-09-18 15:01:52+02"} | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} |{"2002-07-25 07:04:05+02","2002-08-15 07:52:47+02","2002-08-22 11:00:35+02","2002-08-29 11:59:47+02","2002-09-05 13:56:08+02","2002-09-1108:08:52+02","2002-09-16 10:48:37+02","2002-09-20 11:50:46+02","2002-09-23 22:25:32+02","2002-09-2713:01:03+02","2002-10-29 23:31:18+01"} | -0.964541 m_id | 0 | 2 | 1 | {35} | {1} | | 1 (2 rows) > But I don't expect any value of visit to occur much more than twice, > so these values should be irrelevant to our estimations. The probability isn't high, but it could happen to see the same value multiple times. These are web server log data in a one second grid. Regards, -tb -- Thomas Beutin tb@laokoon.IN-Berlin.DE Beam me up, Scotty. There is no intelligent live down in Redmond.
pgsql-general by date: