On Distributions In 7.2 (Longish) - Mailing list pgsql-general
From | Mark kirkwood |
---|---|
Subject | On Distributions In 7.2 (Longish) |
Date | |
Msg-id | 01102716141701.01060@spikey.slithery.org Whole thread Raw |
Responses |
Re: On Distributions In 7.2 (Longish)
|
List | pgsql-general |
The current sources (7.2) have introduced distributional factors into the system statistics. I have been examining the behaviour of these additions, using the dataset from my "warehouse comparison" as a test bed - as it has some large-ish tables with controllable data distributions. I think the results are quite interesting.... Tables ------ Table "dim0" Column | Type | Modifiers --------+--------------------------+----------- d0key | integer | f1 | timestamp with time zone | f2 | character varying(20) | f3 | character varying(20) | Indexes: dim0_q1 ( on f1 - UNIQUE) Unique keys: dim0_pk Rows : 3000 Table "fact0" Column | Type | Modifiers --------+---------+----------- d0key | integer | d1key | integer | d2key | integer | val | integer | filler | text | Indexes: fact0_q1 (on d0key ONLY) Rows : 3000000 Distribution : d0key uniformly distributed 1000 occurrences for each value of d0key - i.e 0.001 frequency for each value of d0key 3000 distinct values of d0key Query ----- The query to be examined is : SELECT d0.f3, count(f.val) FROM dim0 d0, fact0 f WHERE d0.d0key = f.d0key AND d0.f1 between '1999-12-01' AND '2000-02-29' GROUP BY d0.f3; This will join 88 rows from the dim0 table with 88000 from the fact0 table and group them into 3 "month" buckets. Case 1 : -------- Consider using the default distributional sampling settings (10 quantiles) - --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; ANALYZE fact0; System Stats ------------ SELECT most_common_vals,,most_common_freqs,n_distinct FROM pg_stats WHERE tablename = 'fact0' AND attname='d0key'; most_common_vals {"2243","2751","105","250","525","1623","2112","2331","2983","28"} Most_common_freqs {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct 36511 Note we are out by an order of magnitude here for number distinct - should be 3000, and the frequencies are a little overestimated - should be 0.001 QUERY PLAN Aggregate (cost=29712.88..29749.00 rows=722 width=18) -> Group (cost=29712.88..29730.94 rows=7225 width=18) -> Sort (cost=29712.88..29712.88 rows=7225 width=18) -> Nested Loop (cost=0.00..29249.81 rows=7225 width=18) -> Index Scan using dim0_q1 on dim0 d0 (cost=0.00..4.43 rows=89 width=10) -> Index Scan using fact0_q1 on fact0 f (cost=0.00..326.33 rows=81 width=8) RESULTS ------- f3 | count ----+------- 01 | 30000 02 | 28000 12 | 30000 (3 rows) ELAPSED : 19s Clearly the query statistics are underestimating the number of rows for the nested loop join, by about a factor of 10. The estimated rows from dim0 are ok. Case 2 : -------- Lets try 100 quantiles ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; ANALYZE fact0; System Stats ------------ most_common_vals {"328","1242","524","1515","2058","2168",( 94 more values)... most_common_freqs {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",.... n_distinct 3027 Now the number of distinct values is very accurate and frequencies are a little underestimated. QUERY PLAN Aggregate (cost=118518.96..118958.65 rows=8794 width=18) -> Group (cost=118518.96..118738.80 rows=87937 width=18) -> Sort (cost=118518.96..118518.96 rows=87937 width=18) -> Hash Join (cost=4.65..111297.50 rows=87937 width=18) -> Seq Scan on fact0 f (cost=0.00..87693.36 rows=3000036 width=8) -> Hash (cost=4.43..4.43 rows=89 width=10) -> Index Scan using dim0_q1 on dim0 d0 (cost=0.00..4.43 rows=89 width=10) ELAPSED : 60s The query statistics are now very accurate ( e.g 89 rows from dim0 and 87937 rows joined) - note that ironically the better execution plan is chosen with the poorer statistical data ! The conclusion here seems to be that the 10 quantiles are not quite enough for accurate distributional data where (large) tables have a few thousand distinct values. However 100 quantiles was sufficient to get accurate statistics. Further Notes ------------- Experimentation showed that accurate estimates (+/- 10%) of number of distinct values did not begin to appear until about 75 quantiles were used. On the other hand reducing the number of distinct entries by a factor of 10, while keeping the number of rows constant at 3000000 gave rise to accurate statistics with 10 quantiles. Given that the distribution used for fact0 is relatively benign (uniform), the test is hopefully fair (i.e. is not constructed specially to fox the analyzer), However the most common value for fact0 is non-unique ( since all d0key values have the same frequency ) - I am uncertain if this is significant... Tests were perfomed with 7.2 snapshot 16 Oct. regards Mark 4AÆ
pgsql-general by date: