Need help - optimizer trouble - Mailing list pgsql-novice
From | Helge Bahmann |
---|---|
Subject | Need help - optimizer trouble |
Date | |
Msg-id | Pine.LNX.4.21.0104042351270.18606-100000@lothlorien.stunet2.tu-freiberg.de Whole thread Raw |
Responses |
Re: Need help - optimizer trouble
Re: Need help - optimizer trouble |
List | pgsql-novice |
Hi! I am desperate with 7.1beta4; I have a serious performance problem that does not manifest in 7.0.2. This is currently just a feasability test, so I am willing to try every hack. The problematic part of my schema looks as follows: document written_by author author_keyword -------- 1 n ---------- ---------- -------------- *doc_id <---- *doc_id n 1 *name 1 n *keyword doc *author_id ----> *author_id <---- *author_id columns marked with an asterik have an index document contains ~100 000 tuples; each is "written_by" 1-3 out of ~10 000 authors, and every author is associated with ~4 keywords I am trying to retrieve documents written by authors associated with a given keyword. I can get the doc_ids using: SELECT doc_id FROM written_by AS wb JOIN author_keyword AS kw ON kw.author_id=wb.author_id WHERE kw.keyword='foo' 7.1beta4 and 7.0.2 both use the following query plan: Nested Loop (cost=0.00..18.37 rows=8 width=20) -> Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4) -> Index Scan using written_by_idx on written_by (cost=0.00..16.29 rows=4 width=16) I can then take the values returned and retrieve the documents in a separate query; the two queries take at most a second. However if I add a join to get the documents directly: SELECT doc FROM document JOIN written_by AS wb ON document.doc_id=wb.doc_id JOIN author_keyword AS kw ON kw.author_id=wb.author_id WHERE kw.keyword='foo' 7.0.2 will use the following query plan, query takes ~1 second: Nested Loop (cost=0.00..27.53 rows=2 width=32) -> Nested Loop (cost=0.00..10.00 rows=4 width=20) -> Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4) -> Index Scan using written_by_idx on ist_autor_von (cost=0.00..7.95 rows=2 width=16) -> Index Scan using document_pkey on document (cost=0.00..4.16 rows=1 width=12) 7.1beta4 uses the following query plan, query takes ~150 seconds: Nested Loop (cost=8562.09..39846.62 rows=4 width=32) -> Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4) -> Materialize (cost=37969.52..37969.52 rows=100000 width=28) -> Hash Join (cost=8562.09..37969.52 rows=100000 width=28) -> Seq Scan on document (cost=0.00..11932.00 rows=100000 width=12) -> Hash (cost=5129.55..5129.55 rows=203555 width=16) -> Seq Scan on written_by (cost=0.00..5129.55 rows=303555 width=16) The two databases are not identical, but they were created using the same script which fills the database with random data of the above structure. I recreated the databases several times, so it is completely repeatable. I tried playing with the optimizer parameters, but it only got worse because I do not really understand what they are all doing. I fell uncomfortable with the idea of having to split up my query into two separate ones. Can anyone give me any hints how to influence the 7.1 optimizer to behave like 7.0.2 in this case? Is there something wrong with my query? Is there an alternative form which the optimizer can handle better? Thanks for your patience Helge
pgsql-novice by date: