Re: [HACKERS] Cost model for parallel CREATE INDEX - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Cost model for parallel CREATE INDEX |
Date | |
Msg-id | CAH2-Wz=zyvS=cV__bNk0RgwcZMTuB7VBc3y0J7sR=hYaKGZZPw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Cost model for parallel CREATE INDEX (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Cost model for parallel CREATE INDEX
|
List | pgsql-hackers |
On Thu, Mar 2, 2017 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 1, 2017 at 12:58 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> * This scales based on output size (projected index size), not input >> size (heap scan input). Apparently, that's what we always do right >> now. > > Actually, I'm not aware of any precedent for that. I'd just pass the > heap size to compute_parallel_workers(), leaving the index size as 0, > and call it good. What you're doing now seems exactly backwards from > parallel query generally. Sorry, that's what I meant. >> So, the main factor that >> discourages parallel sequential scans doesn't really exist for >> parallel CREATE INDEX. > > Agreed. I'm glad. This justifies the lack of much of any "veto" on the logarithmic scaling. The only thing that can do that is max_parallel_workers_maintenance, the storage parameter parallel_workers (maybe this isn't a storage parameter in V9), and insufficient maintenance_work_mem per worker (as judged by min_parallel_relation_size being greater than workMem per worker). I guess that the workMem scaling threshold thing could be min_parallel_index_scan_size, rather than min_parallel_relation_size (which we now call min_parallel_table_scan_size)? In general, I would expect this to leave most CREATE INDEX statements with a parallel plan in the real world, using exactly the number of workers indicated by the logarithmic scaling. (pg_restore would also not use parallelism, because it's specially disabled -- you have to have set the storage param at some point.) >> We could always defer the cost model to another release, and only >> support the storage parameter for now, though that has disadvantages, >> some less obvious [4]. > > I think it's totally counter-intuitive that any hypothetical index > storage parameter would affect the degree of parallelism involved in > creating the index and also the degree of parallelism involved in > scanning it. Whether or not other systems do such crazy things seems > to me to beside the point. I think if CREATE INDEX allows an explicit > specification of the degree of parallelism (a decision I would favor) > it should have a syntactically separate place for unsaved build > options vs. persistent storage parameters. I can see both sides of it. On the one hand, it's weird that you might have query performance adversely affected by what you thought was a storage parameter that only affected the index build. On the other hand, it's useful that you retain that as a parameter, because you may want to periodically REINDEX, or have a way of ensuring that pg_restore does go on to use parallelism, since it generally won't otherwise. (As mentioned already, pg_restore does not trust the cost model due to issues with the availability of statistics). There are reports on Google of users of these other systems being confused by all this, and I don't think that it's any different there (those other systems don't treat parallel_workers style storage parameter much different for the purposes of index scans, or anything like that). I agree that that isn't very user friendly. In theory, having two index storage parameters solves our problem. I don't like that either, though, since it creates a whole new problem. To be clear, I don't have any strong feelings on all this. I just think it's worth pointing out that there are reasons to not do what you suggest, that you might want to consider if you haven't already. -- Peter Geoghegan
pgsql-hackers by date: