Re: varchar index joins not working? - Mailing list pgsql-performance
From | Adam Gundy |
---|---|
Subject | Re: varchar index joins not working? |
Date | |
Msg-id | 48038E21.50606@starsilk.net Whole thread Raw |
In response to | Re: varchar index joins not working? (Richard Huxton <dev@archonet.com>) |
Responses |
Re: varchar index joins not working?
Re: varchar index joins not working? |
List | pgsql-performance |
Richard Huxton wrote: > Adam Gundy wrote: >> On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <adam@starsilk.net> wrote: >>> Richard Huxton wrote: >>>> How many distinct values do you have in groups.groupid and >>> group_access.group_id? >>> for the small database (since it shows the same problem): >>> >>> group_access: 280/268 >>> groups: 2006/139 >>> >>> for the large database: >>> >>> group_access: same >>> groups: 1712647/140 >>> >>> the groupid key is an MD5 hash, so it should be uniformly distributed. >>> maybe that throws the stats? but, again, char works, varchar doesn't :-( >> >> OK, I'm thinking the varchar/char part is not the issue. > > Good, because it's not :-) hmm. unfortunately it did turn out to be (part) of the issue. I've discovered that mixing char and varchar in a stored procedure does not coerce the types, and ends up doing seq scans all the time. changing something like this: proc x ( y char(32) ) { select * from groups where groupid = y } into this: proc x ( y varchar(32) ) { select * from groups where groupid = y } and suddenly postgres does index lookups in the stored proc... way faster. >> I *think* that when I analyze using char instead of varchar, it is >> recording a stat for the large group, but for some reason with varchar >> doesn't add a stat for that one. >> >> so, the real question is, how do I fix this? I can turn the stats way >> up to 1000, but that doesn't guarantee that I'll get a stat for the >> large group :-( > > Well, by default it will be tracking the 10 most common values (and how > often they occur). As you say, this can be increased to 1000 (although > it obviously takes longer to check 1000 rather than 10). > > We can have a look at the stats with something like: > SELECT * FROM pg_stats WHERE tablename='group_access' AND attname='uid'; > You'll be interested in n_distinct, most_common_vals and most_common_freqs. > > However, I think the problem may be that PG doesn't track cross-column > stats, so it doesn't know that a particular uid implies one or more > particular groupid values. I doubt we could get stats stable enough for this. the number of groups will hopefully be much larger at some point. it's a shame the index entries can't be used to provide information to the planner, eg a rough count of the number of entries for a given key (or subset). it would be nice to be able to create eg a counted btree when you know you have this kind of data as a hint to the planner. >> can I turn the statistics off completely for this column? I'm guessing >> that if I can, that will mean it takes a guess based on the number of >> distinct values in the groups table, which is still large number of >> records, possibly enough to trigger the seqscan anyway. > > No - can't disable stats. Besides, you want it the other way around - > index scans for all groups except the largest. actually, disabling seqscan at the server level gives extremely good response times. I ended up rewriting a few queries that were scanning the whole group for no good reason, and bitmap index hashing seems to take care of things nicely. queries have gone from 30+ seconds to < 0.1 seconds. >> does postgres have a way of building a 'counted index' that the >> planner can use for it's record counts? some way of forcibly >> maintaining a stat for every group? > > No, but let's see what's in pg_stats. no real help there. either it hits the group being read, and does a good plan, or it doesn't, and tries to seqscan (unless I disable it). even forcing stats to 1000 only bandaids the situation, given the number of groups will eventually exceed that..
Attachment
pgsql-performance by date: