Re: experiments in query optimization - Mailing list pgsql-performance
From | Faheem Mitha |
---|---|
Subject | Re: experiments in query optimization |
Date | |
Msg-id | alpine.DEB.2.00.1003311506300.13883@orwell.homelinux.org Whole thread Raw |
In response to | Re: experiments in query optimization (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: experiments in query optimization
Re: experiments in query optimization |
List | pgsql-performance |
[If Kevin Grittner reads this, please fix your email address. I am getting bounces from your email address.] On Tue, 30 Mar 2010, Robert Haas wrote: > On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha <faheem@email.unc.edu> wrote: >> Sure, but define sane setting, please. I guess part of the point is that I'm >> trying to keep memory low, and it seems this is not part of the planner's >> priorities. That it, it does not take memory usage into consideration when >> choosing a plan. If that it wrong, let me know, but that is my >> understanding. > > I don't understand quite why you're confused here. We've already > explained to you that the planner will not employ a plan that uses > more than the amount of memory defined by work_mem for each sort or > hash. > Typical settings for work_mem are between 1MB and 64MB. 1GB is enormous. I don't think I am confused. To be clear, when I said "it does not take memory usage into consideration' I was talking about overall memory usage. Let me summarize: The planner will choose the plan with the minimum total cost, with the constraint that the number of memory used for each of certain steps is less than work_mem. In other words with k such steps it can use at most k(plan)*work_mem memory where k(plan) denotes that k is a function of the plan. (I'm assuming here that memory is not shared between the different steps). However, k(plan)*work_mem is not itself bounded. I fail to see how reducing work_mem significantly would help me. This would mean that the current plans I am using would likely be ruled out, and I would be left with plans which, by definition, would have larger cost and so longer run times. The current runtimes are already quite long - for the PED query, the best I can do with work_mem=1 GB is 2 1/2 hrs, and that is after splitting the query into two pieces. I might actually be better off *increasing* the memory, since then the planner would have more flexibility to choose plans where the individual steps might require more memory, but the overall memory sum might be lower. >>>>> You might need to create some indices, too. >>>> >>>> Ok. To what purpose? This query picks up everything from the >>>> tables and the planner does table scans, so conventional wisdom >>>> and indeed my experience, says that indexes are not going to be so >>>> useful. >>> >>> There are situations where scanning the entire table to build up a >>> hash table is more expensive than using an index. Why not test it? >> >> Certainly, but I don't know what you and Robert have in mind, and I'm not >> experienced enough to make an educated guess. I'm open to specific >> suggestions. > > Try creating an index on geno on the columns that are being used for the join. Ok, I'll try that. I guess the cols in question on geno are idlink_id and anno_id. I thought that I already had indexes on them, but no. Maybe I had indexes, but removed them. If I understand the way this works, if you request, say an INNER JOIN, the planner can choose different ways/algorithms to do this, as in http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a hash join, or an nested loop join or something else, based on cost. If the indexes don't exist that may make the inner loop join more expensive, so tip the balance in favor of using a hash join. However, I have no way to control which option it chooses, short of disabling eg. the hash join option, which is not an option for production usage anyway. Correct? Regards, Faheem.
pgsql-performance by date: