Re: large tables and simple "= constant" queries using indexes - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: large tables and simple "= constant" queries using indexes |
Date | |
Msg-id | op.t9dy1aqwcigqcu@apollo13.peufeu.com Whole thread Raw |
In response to | Re: large tables and simple "= constant" queries using indexes (Arjen van der Meijden <acmmailing@tweakers.net>) |
Responses |
Re: large tables and simple "= constant" queries using
indexes
|
List | pgsql-performance |
>> Perfect - thanks Arjen. Using your value of 200 decreased the time to >> 15 seconds, and using a value of 800 makes it almost instantaneous. I'm >> really not concerned about space usage; if having more statistics >> increases performance this much, maybe I'll just default it to 1000? >> Strangely, the steps taken in the explain analyze are all the same. >> The only differences are the predicted costs (and execution times). >> explain analyze for a statistics of 200: Actually, since you got the exact same plans and the second one is a lot faster, this can mean that the data is in the disk cache, or that the second query has all the rows it needs contiguous on disk whereas the first one has its rows all over the place. Therefore you are IO-bound. Statistics helped, perhaps (impossible to know since you don't provide the plan wit statistics set to 10), but your main problem is IO. Usually setting the statistics to 100 is enough... Now, here are some solutions to your problem in random order : - Install 64 bit Linux, 64 bit Postgres, and get lots of RAM, lol. - Switch to a RAID10 (4 times the IOs per second, however zero gain if you're single-threaded, but massive gain when concurrent) - If you just need a count by gene_ref, a simple solution is to keep it in a separate table and update it via triggers, this is a frequently used solution, it works well unless gene_ref is updated all the time (which is probably not your case). Since you will be vacuuming this count-cache table often, don't put the count as a field in your sgd_annotations table, just create a small table with 2 fields, gene_ref and count (unless you want to use the count for other things and you don't like the join). From your table definition gene_ref references another table. It would seem that you have many rows in gene_prediction_view with the same gene_ref value. - If you often query rows with the same gene_ref, consider using CLUSTER to physically group those rows on disk. This way you can get all rows with the same gene_ref in 1 seek instead of 2000. Clustered tables also make Bitmap scan happy. This one is good since it can also speed up other queries (not just the count). You could also cluster on (gene_ref,go_id) perhaps, I don't know what your columns mean. Only you can decide that because clustering order has to be meaningful (to group rows according to something that makes sense and not at random). * Lose some weight : CREATE INDEX ix_gene_prediction_view_gene_ref ON gene_prediction_view USING btree (gene_ref); - This index is useless since you have an UNIQUE on (gene_ref, go_id) which is also an index. Remove the index on (gene_ref), it will leave space in the disk cache for other things. - Since (gene_ref, go_id) is UNIQUE NOT NULL, you might be able to use that as your primary key, but only if it is never updated of course. Saves another index. - If you often do queries that fetch many rows, but seldom fetch the description, tell PG to always store the description in offline compressed form (read the docs on ALTER TABLE ... SET STORAGE ..., I forgot the syntax). Point being to make the main table smaller. - Also I see a category as VARCHAR. If you have a million different categories, that's OK, but if you have 100 categories for your 15M rows, put them in a separate table and replace that by a category_id (normalize !)
pgsql-performance by date: