Too slow querying a table of 15 million records - Mailing list pgsql-performance
From | kjelle@bingon.no |
---|---|
Subject | Too slow querying a table of 15 million records |
Date | |
Msg-id | 19310.193.213.27.25.1119378838.squirrel@webmail.s7.itpays.net Whole thread Raw |
Responses |
Re: Too slow querying a table of 15 million records
Re: Too slow querying a table of 15 million records Re: Too slow querying a table of 15 million records |
List | pgsql-performance |
Hello! I use FreeBSD 4.11 with PostGreSQL 7.3.8. I got a huge database with roughly 15 million records. There is just one table, with a time field, a few ints and a few strings. table test fields time (timestamp), source (string), destination (string), p1 (int), p2 (int) I have run VACUUM ANALYZE ; I have created indexes on every field, but for some reason my postgre server wants to use a seqscan, even tho i know a indexed scan would be much faster. create index test_time_idx on test (time) ; create index test_source_idx on test (source) ; create index test_destination_idx on test (destination) ; create index test_p1_idx on test (p1) ; create index test_p2_idx on test (p2) ; What is really strange, is that when i query a count(*) on one of the int fields (p1), which has a very low count, postgre uses seqscan. In another count on the same int field (p1), i know he is giving about 2.2 million hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't the whole idea of indexing to increase performance in large queries.. To make sort of a phonebook for the values, to make it faster to look up what ever you need... This just seems opposite.. Here is a EXPLAIN of my query database=> explain select date_trunc('hour', time),count(*) as total from test where p1=53 and time > now() - interval '24 hours' group by date_trunc order by date_trunc ; QUERY PLAN ------------------------------------------------------------------------------------------ Aggregate (cost=727622.61..733143.23 rows=73608 width=8) -> Group (cost=727622.61..731303.02 rows=736083 width=8) -> Sort (cost=727622.61..729462.81 rows=736083 width=8) Sort Key: date_trunc('hour'::text, "time") -> Seq Scan on test (cost=0.00..631133.12 rows=736083 width=8) Filter: ((p1 = 53) AND ("time" > (now() - '1 day'::interval))) (6 rows) database=> drop INDEX test_<TABULATOR> test_source_idx test_destination_idx test_p1_idx test_p2_idx test_time_idx After all this, i tried to set enable_seqscan to off and enable_nestedloops to on. This didnt help much either. The time to run the query is still in minutes. My results are the number of elements for each hour, and it gives about 1000-2000 hits per hour. I have read somewhere, about PostGreSQL, that it can easily handle 100-200million records. And with the right tuned system, have a great performance.. I would like to learn how :) I also found an article on a page (http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php): Tip #11: Don't bother indexing columns with huge numbers of records and a small range of values, such as BOOLEAN columns. This tip, regretfully, is perhaps the only tip where I cannot provide a good, real-world example from my work. So I'll give you a hypothetical situation instead: Imagine that you have a database table with a list of every establishment vending ice cream in the US. A simple example might look like: Where there were almost 1 million rows, but due to simplistic data entry, only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and 3-OTHER) which are relatively evenly distributed. In this hypothetical situation, you might find (with testing using EXPLAIN) that an index on type is ignored and the parser uses a "seq scan" (or table scan) instead. This is because a table scan can actually be faster than an index scan in this situation. Thus, any index on type should be dropped. Certainly, the boolean column (active) requires no indexing as it has only two possible values and no index will be faster than a table scan. Then I ask, what is useful with indexing, when I can't use it on a VERY large database? It is on my 15 million record database it takes for ever to do seqscans over and over again... This is probably why, as i mentioned earlier, the reason (read the quote) why he chooses a full scan and not a indexed one... So what do I do? :confused: I'v used SQL for years, but never in such a big scale. Thus, not having to learn how to deal with large number of records. Usually a maximum of 1000 records. Now, with millions, I need to learn a way to make my sucky queries better. Im trying to learn more about tuning my system, makeing better queries and such. I'v found some documents on the Internet, but far from the best. Feedback most appreciated! Regards, a learning PostGreSQL user
pgsql-performance by date: