Understanding histograms - Mailing list pgsql-performance
From | Len Shapiro |
---|---|
Subject | Understanding histograms |
Date | |
Msg-id | 4817FC00.50609@cs.pdx.edu Whole thread Raw |
Responses |
Re: Understanding histograms
|
List | pgsql-performance |
I hope I am posting to the right list. I am running Postgresql 8.1.9 and don't understand the behavior of histograms for data items not in the MVC list. I teach databases and want to use Postgres as an example. I will appreciate any help that anyone can provide. Here is the data I am using. I am interested only in the "rank" attribute. CREATE TABLE Sailors ( sid Integer NOT NULL, sname varchar(20), rank integer, age real, PRIMARY KEY (sid)); I insert 30 sailor rows: INSERT INTO Sailors VALUES (3, 'Andrew', 10, 30.0); INSERT INTO Sailors VALUES (17, 'Bart', 5, 30.2); INSERT INTO Sailors VALUES (29, 'Beth', 3, 30.4); INSERT INTO Sailors VALUES (28, 'Bryant', 3, 30.6); INSERT INTO Sailors VALUES (4, 'Cynthia', 9, 30.8); INSERT INTO Sailors VALUES (16, 'David', 9, 30.9); INSERT INTO Sailors VALUES (27, 'Fei', 3, 31.0); INSERT INTO Sailors VALUES (12, 'James', 3, 32.0); INSERT INTO Sailors VALUES (30, 'Janice', 3, 33.0); INSERT INTO Sailors VALUES (2, 'Jim', 8, 34.5); INSERT INTO Sailors VALUES (15, 'Jingke', 10, 35.0); INSERT INTO Sailors VALUES (26, 'Jonathan',9, 36.0); INSERT INTO Sailors VALUES (24, 'Kal', 3, 36.6); INSERT INTO Sailors VALUES (14, 'Karen', 8, 37.8); INSERT INTO Sailors VALUES (8, 'Karla',7, 39.0); INSERT INTO Sailors VALUES (25, 'Kristen', 10, 39.5); INSERT INTO Sailors VALUES (19, 'Len', 8, 40.0); INSERT INTO Sailors VALUES (7, 'Lois', 8, 41.0); INSERT INTO Sailors VALUES (13, 'Mark', 7, 43.0); INSERT INTO Sailors VALUES (18, 'Melanie', 1, 44.0); INSERT INTO Sailors VALUES (5, 'Niru', 5, 46.0); INSERT INTO Sailors VALUES (23, 'Pavel', 3, 48.0); INSERT INTO Sailors VALUES (1, 'Sergio', 7, 50.0); INSERT INTO Sailors VALUES (6, 'Suhui', 1, 51.0); INSERT INTO Sailors VALUES (22, 'Suresh',9, 52.0); INSERT INTO Sailors VALUES (20, 'Tim',7, 54.0); INSERT INTO Sailors VALUES (21, 'Tom', 10, 56.0); INSERT INTO Sailors VALUES (11, 'Warren', 3, 58.0); INSERT INTO Sailors VALUES (10, 'WuChang',9, 59.0); INSERT INTO Sailors VALUES (9, 'WuChi', 10, 60.0); after analyzing, I access the pg_stats table with SELECT n_distinct, most_common_vals, most_common_freqs, histogram_bounds FROM pg_stats WHERE tablename = 'sailors' AND attname = 'rank'; and I get: n_distinct most_common_vals most_common_freqs histogram_bounds -0.233333 {3,9,10,7,8} {0.266667,0.166667,0.166667,0.133333,0.133333} {1,5} I have two questions. I'd appreciate any info you can provide, including pointers to the source code. 1. Why does Postgres come up with a negative n_distinct? It apparently thinks that the number of rank values will increase as the number of sailors increases. What/where is the algorithm that decides that? 2. The most_common_vals and their frequencies make sense. They say that the values {3,9,10,7,8} occur a total of 26 times, so other values occur a total of 4 times. The other, less common, values are 1 and 5, each occuring twice, so the histogram {1,5} is appropriate. If I run the query EXPLAIN SELECT * from sailors where rank = const; for any const not in the MVC list, I get the plan Seq Scan on sailors (cost=0.00..1.38 rows=2 width=21) Filter: (rank = const) The "rows=2" estimate makes sense when const = 1 or 5, but it makes no sense to me for other values of const not in the MVC list. For example, if I run the query EXPLAIN SELECT * from sailors where rank = -1000; Postgres still gives an estimate of "row=2". Can someone please explain? Thanks, Len Shapiro Portland State University
pgsql-performance by date: