Thread: index-only scans
Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. I'm able to demonstrate a significant performance benefit from this patch, even in a situation where it doesn't save any disk I/O, due to reduced thrashing of shared_buffers. Configuration settings: max_connections = 100 shared_buffers = 400MB maintenance_work_mem = 256MB synchronous_commit = off checkpoint_segments = 100 checkpoint_timeout = 30min checkpoint_completion_target = 0.9 checkpoint_warning = 90s seq_page_cost = 1.0 random_page_cost = 1.0 effective_cache_size = 3GB Test setup: pgbench -i -s 50 create table sample_data as select (random()*5000000)::int as aid, repeat('a', 1000) as filler from generate_series(1,100000); Test queries: select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); On my laptop, the first query executes in about 555 ms, while the second one takes about 1125 ms. Inspection via pg_buffercache reveals that the second one thrashes shared_buffers something fierce, while the first one does not. You can actually get the time for the first query down to about 450 ms if you can persuade PostgreSQL to cache the entire sample_data table - which is difficult, due the BufferAccessStrategy stuff - and as soon as you run the second query, it blows the table out of cache, so practically speaking you're not going to get that faster time very often. I expect that you could get an even larger benefit from this type of query if you could avoid actual disk I/O, rather than just buffer cache thrashing, but I haven't come up with a suitable test cases for that yet (volunteers?). There are several things about this patch that probably need further thought and work, or at least some discussion. 1. The way that nodeIndexscan.c builds up the faux heap tuple is perhaps susceptible to improvement. I thought about building a virtual tuple, but then what do I do with an OID column, if I have one? Or maybe this should be done some other way altogether. 2. Suppose we scan one tuple on a not-all-visible page followed by 99 tuples on all-visible pages. The code as written will hold the pin on the first heap page for the entire scan. As soon as we hit the end of the scan or another tuple where we have to actually visit the page, the old pin will be released, but until then we hold onto it. This isn't totally stupid: the next tuple that's on a not-all-visible page could easily be on the same not-all-visible page we already have pinned. And in 99% cases holding the pin for slightly longer is probably completely harmless. On the flip side, it could increase the chances of interfering with VACUUM. Then again, a non-index-only scan would still interfere with the same VACUUM, so maybe we don't care. 3. The code in create_index_path() builds up a bitmapset of heap attributes that get used for any purpose anywhere in the query, and hangs it on the RelOptInfo so it doesn't need to be rebuilt for every index under consideration. However, if it were somehow possible to have the rel involved without using any attributes at all, we'd rebuild the cache over and over, since it would never become non-NULL. I don't think that can happen right now, but future planner changes might make it possible. 4. There are a couple of cases that use index-only scans even though the EXPLAIN output sort of makes it look like they shouldn't. For example, in the above queries, an index-only scan is chosen even though the query does "SELECT *" from the table being scanned. Even though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems that the target list of an EXISTS query is in fact discarded, e.g.: create or replace function error() returns int as $$begin select 1/0; end$$ language plpgsql; select * from pgbench_accounts a where exists (select error() from pgbench_branches b where b.bid = a.aid); -- doesn't error out! Along similar lines, COUNT(*) does not preclude an index-only scan, because the * is apparently just window dressing. You'll still get just a seq-scan unless you have an indexable qual in the query somewhere, because... 5. We haven't made any planner changes at all, not even for cost estimation. It is not clear to me what the right way to do cost estimation here is. It seems that it would be useful to know what percentage of the table is all-visible at planning time, but even if we did, there could (and probably often will) be correlation between the pages the query is interested in and which visibility map bits are set. So I'm concerned about being overly optimistic about the cost savings. Also, there's the problem of figuring out how to keep the percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe that's a job for ANALYZE but I haven't thought about it much yet. 6. I'm sure there are probably some statements in the documentation that need updating, but I haven't tracked 'em down yet. Comments, testing, review appreciated... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > 1. The way that nodeIndexscan.c builds up the faux heap tuple is > perhaps susceptible to improvement. I thought about building a > virtual tuple, but then what do I do with an OID column, if I have > one? Or maybe this should be done some other way altogether. Maybe it's time to finally remove the been-deprecated-for-a-while OIDs? - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201108111654 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAk5EQiEACgkQvJuQZxSWSsglcQCeKsLRvd958M5QJ8YC8aNqr/Ku 11QAn1Iwaz9GuGVOB28orAITCsSX4MOo =JMag -----END PGP SIGNATURE-----
On 08/11/2011 01:57 PM, Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > >> 1. The way that nodeIndexscan.c builds up the faux heap tuple is >> perhaps susceptible to improvement. I thought about building a >> virtual tuple, but then what do I do with an OID column, if I have >> one? Or maybe this should be done some other way altogether. > > Maybe it's time to finally remove the been-deprecated-for-a-while OIDs? As a user space option certainly. However they are also used to track system objects, I don't know that we need to get rid of them for that purpose. We just need to remove WITH/WITHOUT OIDS for user tables. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579
On Thu, Aug 11, 2011 at 4:57 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: >> 1. The way that nodeIndexscan.c builds up the faux heap tuple is >> perhaps susceptible to improvement. I thought about building a >> virtual tuple, but then what do I do with an OID column, if I have >> one? Or maybe this should be done some other way altogether. > > Maybe it's time to finally remove the been-deprecated-for-a-while OIDs? I thought about just not supporting that for index-only scans, but system catalogs use them pretty extensively, and it doesn't seem out of the question that that could matter to people who have lots of SQL objects floating around. I am *definitely* not volunteering to reengineer OIDs out of the system catalogs. For that amount of work, I could implement probably half a dozen major features any single one of which would be more valuable than the tiny amount of notational convenience such a change would buy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/8/11 Robert Haas <robertmhaas@gmail.com>: > Please find attached a patch implementing a basic version of > index-only scans. This patch is the work of my colleague Ibrar Ahmed > and myself, and also incorporates some code from previous patches > posted by Heikki Linnakanagas. Great!. > > I'm able to demonstrate a significant performance benefit from this > patch, even in a situation where it doesn't save any disk I/O, due to > reduced thrashing of shared_buffers. Configuration settings: > > max_connections = 100 > shared_buffers = 400MB > maintenance_work_mem = 256MB > synchronous_commit = off > checkpoint_segments = 100 > checkpoint_timeout = 30min > checkpoint_completion_target = 0.9 > checkpoint_warning = 90s > seq_page_cost = 1.0 > random_page_cost = 1.0 > effective_cache_size = 3GB > > Test setup: > > pgbench -i -s 50 > create table sample_data as select (random()*5000000)::int as aid, > repeat('a', 1000) as filler from generate_series(1,100000); > > Test queries: > > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); > > On my laptop, the first query executes in about 555 ms, while the > second one takes about 1125 ms. Inspection via pg_buffercache reveals > that the second one thrashes shared_buffers something fierce, while > the first one does not. You can actually get the time for the first > query down to about 450 ms if you can persuade PostgreSQL to cache the > entire sample_data table - which is difficult, due the > BufferAccessStrategy stuff - and as soon as you run the second query, > it blows the table out of cache, so practically speaking you're not > going to get that faster time very often. I expect that you could get > an even larger benefit from this type of query if you could avoid > actual disk I/O, rather than just buffer cache thrashing, but I > haven't come up with a suitable test cases for that yet (volunteers?). > > There are several things about this patch that probably need further > thought and work, or at least some discussion. > > 1. The way that nodeIndexscan.c builds up the faux heap tuple is > perhaps susceptible to improvement. I thought about building a > virtual tuple, but then what do I do with an OID column, if I have > one? Or maybe this should be done some other way altogether. Can this faux heap tuple be appended by the data from another index once it has been created ? ( if the query involves those 2 index) -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs? > I thought about just not supporting that for index-only scans, but > system catalogs use them pretty extensively, and it doesn't seem out > of the question that that could matter to people who have lots of SQL > objects floating around. Right - when I said remove, I meant for all but system catalogs. I would think those are generally small enough that for most people the lack of index-only scans on those would not matter. Heck, the system catalogs are already "special" in lots of ways other than having OIDs (most anyway), so it's not as though we'd be breaking sacred ground with an index-only exception. :) I guess the question that should be asked is "we are going to finally remove OIDs someday, right?". If so, and if it's potentially blocking a major new feature, why not now? - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201108112140 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAk5EhYEACgkQvJuQZxSWSsjnYQCgne81uKjiABVU3X3X+5cM/oFx 74YAoNX97hsOIxBx4Y1hcQHf/bWR813U =hEl2 -----END PGP SIGNATURE-----
On 08/11/2011 09:44 PM, Greg Sabino Mullane wrote: > > I guess the question that should be asked is "we are going to finally > remove OIDs someday, right?". If so, and if it's potentially blocking a > major new feature, why not now? > > It seems a bit odd then that we added "ALTER TABLE SET WITH OIDS" just a couple of years ago. cheers andrew
On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > 2011/8/11 Robert Haas <robertmhaas@gmail.com>: >> Please find attached a patch implementing a basic version of >> index-only scans. This patch is the work of my colleague Ibrar Ahmed >> and myself, and also incorporates some code from previous patches >> posted by Heikki Linnakanagas. > > Great! Glad you like it...! > Can this faux heap tuple be appended by the data from another index > once it has been created ? ( if the query involves those 2 index) I don't see how to make that work. In general, a query like "SELECT a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we use a bitmap index scan on each followed by a bitmapand and then a bitmap heap scan. However, this patch only touches the index-scan path, which only knows how to use one index for any given query. Actually, I can see a possible way to allow an index-only type optimization to be used for bitmap scans. As you scan the index, any tuples that can be handled index-only get returned immediately; the rest are thrown into a bitmap. Once you're done examining the index, you then do a bitmap heap scan to get the tuples that couldn't be handled index-only. This seems like it might be our best hope for a "fast count(*)" type optimization, especially if you could combine it with some method of scanning the index in physical order rather than logical order. But even that trick would only work with a single index. I'm not sure there's any good way to assemble pieces of tuples from two different indexes, at least not efficiently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 11, 2011 at 9:44 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: >>> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs? > >> I thought about just not supporting that for index-only scans, but >> system catalogs use them pretty extensively, and it doesn't seem out >> of the question that that could matter to people who have lots of SQL >> objects floating around. > > Right - when I said remove, I meant for all but system catalogs. I > would think those are generally small enough that for most people > the lack of index-only scans on those would not matter. Heck, the > system catalogs are already "special" in lots of ways other than > having OIDs (most anyway), so it's not as though we'd be breaking > sacred ground with an index-only exception. :) Yeah, maybe. But since there's no evidence that we actually need that exception for performance, I'm not in favor of adding it at this point. > I guess the question that should be asked is "we are going to finally > remove OIDs someday, right?". I don't necessarily see a reason to do that. I wouldn't object to turning the system OID columns in the system catalogs into normal OID columns, but that would be a lot of work and it doesn't seem to me to be the most important problem we need to solve (or even in the top 25). > If so, and if it's potentially blocking a > major new feature, why not now? It's not blocking a major new feature, except to the extent that we're having a conversation about it, instead of talking about the major new feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/8/12 Robert Haas <robertmhaas@gmail.com>: > On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >> 2011/8/11 Robert Haas <robertmhaas@gmail.com>: >>> Please find attached a patch implementing a basic version of >>> index-only scans. This patch is the work of my colleague Ibrar Ahmed >>> and myself, and also incorporates some code from previous patches >>> posted by Heikki Linnakanagas. >> >> Great! > > Glad you like it...! > >> Can this faux heap tuple be appended by the data from another index >> once it has been created ? ( if the query involves those 2 index) > > I don't see how to make that work. In general, a query like "SELECT > a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we > use a bitmap index scan on each followed by a bitmapand and then a > bitmap heap scan. However, this patch only touches the index-scan > path, which only knows how to use one index for any given query. I thought of something like that: 'select a,b from foo where a=1 order by b limit 100' (or: where a=1 and b< now() ) > > Actually, I can see a possible way to allow an index-only type > optimization to be used for bitmap scans. As you scan the index, any > tuples that can be handled index-only get returned immediately; the > rest are thrown into a bitmap. Once you're done examining the index, > you then do a bitmap heap scan to get the tuples that couldn't be > handled index-only. This seems like it might be our best hope for a > "fast count(*)" type optimization, especially if you could combine it > with some method of scanning the index in physical order rather than > logical order. IIRC we expose some ideas around that, yes. (optimizing bitmap) Maybe a question that will explain me more about the feature limitation (if any): Does an index-only scan used when the table has no vismap set will cost (in duration, IO, ...) more than a normal Index scan ? > > But even that trick would only work with a single index. I'm not sure > there's any good way to assemble pieces of tuples from two different > indexes, at least not efficiently. okay. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Robert, I imagine we store positional information in gin index and return tuples in relevant order - instant full-text search ! Great work, guys ! Oleg On Thu, 11 Aug 2011, Robert Haas wrote: > Please find attached a patch implementing a basic version of > index-only scans. This patch is the work of my colleague Ibrar Ahmed > and myself, and also incorporates some code from previous patches > posted by Heikki Linnakanagas. > > I'm able to demonstrate a significant performance benefit from this > patch, even in a situation where it doesn't save any disk I/O, due to > reduced thrashing of shared_buffers. Configuration settings: > > max_connections = 100 > shared_buffers = 400MB > maintenance_work_mem = 256MB > synchronous_commit = off > checkpoint_segments = 100 > checkpoint_timeout = 30min > checkpoint_completion_target = 0.9 > checkpoint_warning = 90s > seq_page_cost = 1.0 > random_page_cost = 1.0 > effective_cache_size = 3GB > > Test setup: > > pgbench -i -s 50 > create table sample_data as select (random()*5000000)::int as aid, > repeat('a', 1000) as filler from generate_series(1,100000); > > Test queries: > > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); > > On my laptop, the first query executes in about 555 ms, while the > second one takes about 1125 ms. Inspection via pg_buffercache reveals > that the second one thrashes shared_buffers something fierce, while > the first one does not. You can actually get the time for the first > query down to about 450 ms if you can persuade PostgreSQL to cache the > entire sample_data table - which is difficult, due the > BufferAccessStrategy stuff - and as soon as you run the second query, > it blows the table out of cache, so practically speaking you're not > going to get that faster time very often. I expect that you could get > an even larger benefit from this type of query if you could avoid > actual disk I/O, rather than just buffer cache thrashing, but I > haven't come up with a suitable test cases for that yet (volunteers?). > > There are several things about this patch that probably need further > thought and work, or at least some discussion. > > 1. The way that nodeIndexscan.c builds up the faux heap tuple is > perhaps susceptible to improvement. I thought about building a > virtual tuple, but then what do I do with an OID column, if I have > one? Or maybe this should be done some other way altogether. > > 2. Suppose we scan one tuple on a not-all-visible page followed by 99 > tuples on all-visible pages. The code as written will hold the pin on > the first heap page for the entire scan. As soon as we hit the end of > the scan or another tuple where we have to actually visit the page, > the old pin will be released, but until then we hold onto it. This > isn't totally stupid: the next tuple that's on a not-all-visible page > could easily be on the same not-all-visible page we already have > pinned. And in 99% cases holding the pin for slightly longer is > probably completely harmless. On the flip side, it could increase the > chances of interfering with VACUUM. Then again, a non-index-only scan > would still interfere with the same VACUUM, so maybe we don't care. > > 3. The code in create_index_path() builds up a bitmapset of heap > attributes that get used for any purpose anywhere in the query, and > hangs it on the RelOptInfo so it doesn't need to be rebuilt for every > index under consideration. However, if it were somehow possible to > have the rel involved without using any attributes at all, we'd > rebuild the cache over and over, since it would never become non-NULL. > I don't think that can happen right now, but future planner changes > might make it possible. > > 4. There are a couple of cases that use index-only scans even though > the EXPLAIN output sort of makes it look like they shouldn't. For > example, in the above queries, an index-only scan is chosen even > though the query does "SELECT *" from the table being scanned. Even > though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems > that the target list of an EXISTS query is in fact discarded, e.g.: > > create or replace function error() returns int as $$begin select 1/0; > end$$ language plpgsql; > select * from pgbench_accounts a where exists (select error() from > pgbench_branches b where b.bid = a.aid); -- doesn't error out! > > Along similar lines, COUNT(*) does not preclude an index-only scan, > because the * is apparently just window dressing. You'll still get > just a seq-scan unless you have an indexable qual in the query > somewhere, because... > > 5. We haven't made any planner changes at all, not even for cost > estimation. It is not clear to me what the right way to do cost > estimation here is. It seems that it would be useful to know what > percentage of the table is all-visible at planning time, but even if > we did, there could (and probably often will) be correlation between > the pages the query is interested in and which visibility map bits are > set. So I'm concerned about being overly optimistic about the cost > savings. Also, there's the problem of figuring out how to keep the > percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe > that's a job for ANALYZE but I haven't thought about it much yet. > > 6. I'm sure there are probably some statements in the documentation > that need updating, but I haven't tracked 'em down yet. > > Comments, testing, review appreciated... > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: >>> Can this faux heap tuple be appended by the data from another index >>> once it has been created ? ( if the query involves those 2 index) >> >> I don't see how to make that work. In general, a query like "SELECT >> a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we >> use a bitmap index scan on each followed by a bitmapand and then a >> bitmap heap scan. However, this patch only touches the index-scan >> path, which only knows how to use one index for any given query. > > I thought of something like that: 'select a,b from foo where a=1 > order by b limit 100' (or: where a=1 and b< now() ) Well... PostgreSQL can only use the index on a or the index on b, not both. This patch doesn't change that. I'm not trying to use indexes in some completely new way; I'm just trying to make them faster by optimizing away the heap access. >> Actually, I can see a possible way to allow an index-only type >> optimization to be used for bitmap scans. As you scan the index, any >> tuples that can be handled index-only get returned immediately; the >> rest are thrown into a bitmap. Once you're done examining the index, >> you then do a bitmap heap scan to get the tuples that couldn't be >> handled index-only. This seems like it might be our best hope for a >> "fast count(*)" type optimization, especially if you could combine it >> with some method of scanning the index in physical order rather than >> logical order. > > IIRC we expose some ideas around that, yes. (optimizing bitmap) > > Maybe a question that will explain me more about the feature > limitation (if any): > Does an index-only scan used when the table has no vismap set will > cost (in duration, IO, ...) more than a normal Index scan ? Yeah, it'll do a bit of extra work - the btree AM will cough up the tuple uselessly, and we'll check the visibility map, also uselessly. Then we'll end up doing it the regular way anyhow. I haven't measured that effect yet; hopefully it's fairly small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/8/12 Robert Haas <robertmhaas@gmail.com>: > On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >>>> Can this faux heap tuple be appended by the data from another index >>>> once it has been created ? ( if the query involves those 2 index) >>> >>> I don't see how to make that work. In general, a query like "SELECT >>> a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we >>> use a bitmap index scan on each followed by a bitmapand and then a >>> bitmap heap scan. However, this patch only touches the index-scan >>> path, which only knows how to use one index for any given query. >> >> I thought of something like that: 'select a,b from foo where a=1 >> order by b limit 100' (or: where a=1 and b< now() ) > > Well... PostgreSQL can only use the index on a or the index on b, not > both. This patch doesn't change that. I'm not trying to use indexes > in some completely new way; I'm just trying to make them faster by > optimizing away the heap access. For this kind of plan : Bitmap Heap Scan Recheck Cond BitmapAnd Bitmap Index Scan Bitmap Index Scan It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ? > >>> Actually, I can see a possible way to allow an index-only type >>> optimization to be used for bitmap scans. As you scan the index, any >>> tuples that can be handled index-only get returned immediately; the >>> rest are thrown into a bitmap. Once you're done examining the index, >>> you then do a bitmap heap scan to get the tuples that couldn't be >>> handled index-only. This seems like it might be our best hope for a >>> "fast count(*)" type optimization, especially if you could combine it >>> with some method of scanning the index in physical order rather than >>> logical order. >> >> IIRC we expose some ideas around that, yes. (optimizing bitmap) >> >> Maybe a question that will explain me more about the feature >> limitation (if any): >> Does an index-only scan used when the table has no vismap set will >> cost (in duration, IO, ...) more than a normal Index scan ? > > Yeah, it'll do a bit of extra work - the btree AM will cough up the > tuple uselessly, and we'll check the visibility map, also uselessly. > Then we'll end up doing it the regular way anyhow. I haven't measured > that effect yet; hopefully it's fairly small. If it is small, or if we can reduce it to be near absent. Then... why do we need to distinguish Index Scan at all ? (or Index*-Only* scan, which aren't 100% 'Only' btw) It is then just an internal optimisation on how we can access/use an index. (really good and promising one) -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On Fri, Aug 12, 2011 at 9:31 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: >> Well... PostgreSQL can only use the index on a or the index on b, not >> both. This patch doesn't change that. I'm not trying to use indexes >> in some completely new way; I'm just trying to make them faster by >> optimizing away the heap access. > > For this kind of plan : > Bitmap Heap Scan > Recheck Cond > BitmapAnd > Bitmap Index Scan > Bitmap Index Scan > > It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ? The input to the bitmap heap scan is just a TID bitmap. It's too late to decide we want the index tuples at that point; we've forgotten where they are, and even if we remembered it, it would necessarily be any cheaper than checking the heap. We could optimize this to avoid a heap fetch in the case where we don't need any of the tuple data at all, but that's going to be somewhat rare, I think. >>>> Actually, I can see a possible way to allow an index-only type >>>> optimization to be used for bitmap scans. As you scan the index, any >>>> tuples that can be handled index-only get returned immediately; the >>>> rest are thrown into a bitmap. Once you're done examining the index, >>>> you then do a bitmap heap scan to get the tuples that couldn't be >>>> handled index-only. This seems like it might be our best hope for a >>>> "fast count(*)" type optimization, especially if you could combine it >>>> with some method of scanning the index in physical order rather than >>>> logical order. >>> >>> IIRC we expose some ideas around that, yes. (optimizing bitmap) >>> >>> Maybe a question that will explain me more about the feature >>> limitation (if any): >>> Does an index-only scan used when the table has no vismap set will >>> cost (in duration, IO, ...) more than a normal Index scan ? >> >> Yeah, it'll do a bit of extra work - the btree AM will cough up the >> tuple uselessly, and we'll check the visibility map, also uselessly. >> Then we'll end up doing it the regular way anyhow. I haven't measured >> that effect yet; hopefully it's fairly small. > > If it is small, or if we can reduce it to be near absent. > Then... why do we need to distinguish Index Scan at all ? (or > Index*-Only* scan, which aren't 100% 'Only' btw) > It is then just an internal optimisation on how we can access/use an > index. (really good and promising one) Right, that's kind of what I was going for. But the overhead isn't going to be exactly zero, so I think it makes sense to disable it in the cases where it clearly can't work. The question is really more whether we need to get more fine-grained than that, and actually do a cost-based comparison of an index-only scan vs. a regular index scan. I hope not, but I haven't tested it yet. One other thing to think about is that the choice to use an index-scan is not free of externalities. The index-only scan is hopefully faster than touching all the heap pages, but even if it weren't, not touching all of the heap pages potentially means avoiding evicting things from shared_buffers that some *other* query might need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11.08.2011 23:06, Robert Haas wrote: > Comments, testing, review appreciated... I would've expected this to use an index-only scan: postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id; SELECT 100000 postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10; CREATE INDEX postgres=# VACUUM ANALYZE foo; VACUUM postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN ----------------------------------------------------------------- Index Scan using i_foo on foo (cost=0.00..8.27 rows=1width=4) Index Cond: (id = 10) (2 rows) If it's not a predicate index, then it works: postgres=# DROP INDEX i_foo; DROP INDEX postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN ----------------------------------------------------------------------- Index Only Scan using i_foo2 on foo (cost=0.00..8.28rows=1 width=4) Index Cond: (id = 10) (2 rows) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 12, 2011 at 4:03 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 11.08.2011 23:06, Robert Haas wrote: >> >> Comments, testing, review appreciated... > > I would've expected this to use an index-only scan: > > postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id; Ugh. I think there's something wrong with this test: + if (index->amcanreturn && !index->predOK && !index->indexprs) I think that should probably say something like (ind->indpred == NIL || ind->predOK). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Aug 12, 2011, at 10:03 PM, Heikki Linnakangas wrote: > On 11.08.2011 23:06, Robert Haas wrote: >> Comments, testing, review appreciated... > > I would've expected this to use an index-only scan: > > postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id; > SELECT 100000 > postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10; > CREATE INDEX > postgres=# VACUUM ANALYZE foo; > VACUUM > postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; > QUERY PLAN > ----------------------------------------------------------------- > Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4) > Index Cond: (id = 10) > (2 rows) > > If it's not a predicate index, then it works: > > postgres=# DROP INDEX i_foo; > DROP INDEX > postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; > QUERY PLAN > ----------------------------------------------------------------------- > Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4) > Index Cond: (id = 10) > (2 rows) is there any plan to revise the cost for index only scans compared to what it is now? many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
2011/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>: > is there any plan to revise the cost for index only scans compared to what it is now? That's one of the points I asked for feedback on in my original email."How should the costing be done?" -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > That's one of the points I asked for feedback on in my original > email. "How should the costing be done?" It seems pretty clear that there should be some cost adjustment. If you can't get good numbers somehow on what fraction of the heap accesses will be needed, I would suggest using a magic number based on the assumption that half the heap access otherwise necessary will be done. It wouldn't be the worst magic number in the planner. Of course, real numbers are always better if you can get them. -Kevin
On Fri, Aug 12, 2011 at 5:39 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> That's one of the points I asked for feedback on in my original >> email. "How should the costing be done?" > > It seems pretty clear that there should be some cost adjustment. If > you can't get good numbers somehow on what fraction of the heap > accesses will be needed, I would suggest using a magic number based > on the assumption that half the heap access otherwise necessary will > be done. It wouldn't be the worst magic number in the planner. Of > course, real numbers are always better if you can get them. It wouldn't be that difficult (I think) to make VACUUM and/or ANALYZE gather some statistics; what I'm worried about is that we'd have correlation problems. Consider a wide table with an index on (id, name), and the query: SELECT name FROM tab WHERE id = 12345 Now, suppose that we know that 50% of the heap pages have their visibility map bits set. What's the chance that this query won't need a heap fetch? Well, the chance is 50% *if* you assume that a row which has been quiescent for a long time is just as likely to be queried as one that has been recently inserted or updated. However, in many real-world use cases, nothing could be farther from the truth. What do we do about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
""" Now, suppose that we know that 50% of the heap pages have their visibility map bits set. What's the chance that this query won't need a heap fetch? Well, the chance is 50% *if* you assume that a row which has been quiescent for a long time is just as likely to be queried as one that has been recently inserted or updated. However, in many real-world use cases, nothing could be farther from the truth. What do we do about that? """ The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the wholerelation's visibility statistics are completely wrong for that query. Wouldn't it be great if there was something likepg_stat_statements that would know the statistics per query, derived from usage... Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weightedvisibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage wouldminimize the average cost of the plans much better than just the non-weighted visibility percentage. For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibilitypercentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentagewould be incorrect for the queries reading old rows, by definition of the weighted visibility percentage therewould not be too many of them. The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would havethe problem of feedback loops, though. Of course, keeping such statistic could be more expensive than the benefit it gives. On the other hand, page hit percentageis already available... - Anssi
On 13.08.2011 23:35, Kääriäinen Anssi wrote: > """ > Now, suppose that we know that 50% of the heap pages have their > visibility map bits set. What's the chance that this query won't need > a heap fetch? Well, the chance is 50% *if* you assume that a row > which has been quiescent for a long time is just as likely to be > queried as one that has been recently inserted or updated. However, > in many real-world use cases, nothing could be farther from the truth. > > What do we do about that? > """ > > The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and thewhole relation's visibility statistics are completely wrong for that query. That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. > Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived fromusage... > > Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weightedvisibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage wouldminimize the average cost of the plans much better than just the non-weighted visibility percentage. > > For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibilitypercentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentagewould be incorrect for the queries reading old rows, by definition of the weighted visibility percentage therewould not be too many of them. > > The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio wouldhave the problem of feedback loops, though. Yeah, I'm not excited about making the planner and statistics more dynamic. Feedback loops and plan instability are not fun. I think we should rather be more aggressive in setting the visibility map bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote: >> The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and thewhole relation's visibility statistics are completely wrong for that query. > > That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache.Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. > > I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pagesthat have been modified since the last vacuum, but not in cache anymore. http://xkcd.org/937/ :) Could something be added to pg_stats that tracks visibility map usefulness on a per-attribute basis? Perhaps another setof stats buckets that show visibility percentages for each stats bucket? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 08/11/2011 04:06 PM, Robert Haas wrote: > On my laptop, the first query executes in about 555 ms, while the > second one takes about 1125 ms...I expect that you could get > an even larger benefit from this type of query if you could avoid > actual disk I/O, rather than just buffer cache thrashing, but I > haven't come up with a suitable test cases for that yet (volunteers?). > That part I can help with, using a Linux test that kills almost every cache. I get somewhat faster times on my desktop here running the cached version like you were doing (albeit with debugging options on, so I wouldn't read too much into this set of numbers): select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); sum -------------- 250279412983 (1 row) Time: 472.778 ms QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Aggregate (cost=133325.00..133325.01 rows=1 width=4) -> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4) -> Seq Scan on sample_data a1 (cost=0.00..15286.00 rows=100000 width=4) -> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.00..1.17 rows=1 width=4) Index Cond: (aid = a1.aid) Filter: (aid<> 1234567) select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); sum -------------- 250279412983 Time: 677.902 ms explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=133325.00..133325.01rows=1 width=4) -> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4) -> Seq Scan on sample_data a1 (cost=0.00..15286.00 rows=100000 width=4) -> Index Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.00..1.17 rows=1 width=4) Index Cond: (aid = a1.aid) Filter: (bid <> 1234567) If I setup my gsmith account to be able to start and stop the server with pg_ctl and a valid PGDATA, and drop these two scripts in that home directory: == index-only-1.sql == \timing select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); == index-only-2.sql == \timing select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); I can then run this script as root: #!/bin/bash ME="gsmith" su - $ME -c "pg_ctl stop -w" echo 3 > /proc/sys/vm/drop_caches su - $ME -c "pg_ctl start -w" su - $ME -c "psql -ef index-only-1.sql" su - $ME -c "pg_ctl stop -w" echo 3 > /proc/sys/vm/drop_caches su - $ME -c "pg_ctl start -w" su - $ME -c "psql -ef index-only-2.sql" And get results that start with zero information cached in RAM, showing a much more dramatic difference. Including some snippets of interesting vmstat too, the index-only one gets faster as it runs while the regular one is pretty flat: select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); Time: 31677.683 ms $ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 1 0 15807288 4388 126440 0 0 4681 118 1407 2432 1 1 89 10 1 1 0 15779388 4396 154448 0 0 3587 17 1135 2058 1 0 86 13 0 1 0 15739956 4396 193672 0 0 5800 0 1195 2056 1 0 87 12 0 1 0 15691844 4396 241832 0 0 7053 3 1299 2044 1 0 86 13 0 1 0 15629736 4396 304096 0 0 7995 37 1391 2053 1 0 87 12 0 1 0 15519244 4400 414268 0 0 11639 14 1448 2189 1 0 87 12 select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); Time: 172381.235 ms $ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 1 0 15736500 4848 196444 0 0 3142 22 1092 1989 1 0 86 13 0 1 0 15711948 4852 221072 0 0 3411 1 1039 1943 0 0 88 12 0 1 0 15685412 4852 247496 0 0 3618 34 1111 1997 0 0 86 13 [this is the middle part, rate doesn't vary too much] That's 5.4X as fast; not too shabby! Kind of interesting how much different the I/O pattern is on the index-only version. I ran this test against a 3-disk RAID0 set with a 256MB BBWC, so there's some possibility of caching here. But given that each query blows away a large chunk of the other's data, I wouldn't expect that to be a huge factor here: gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts')); pg_size_pretty ---------------- 640 MB gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts_pkey')); pg_size_pretty ---------------- 107 MB gsmith=# select pg_size_pretty(pg_relation_size('sample_data')); pg_size_pretty ---------------- 112 MB And with the large difference in response time, things appear to be working as hoped even in this situation. If you try this on your laptop, where drive cache size and random I/O are likely to be even slower, you might see an ever larger difference. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Mon, Aug 15, 2011 at 7:37 PM, Greg Smith <greg@2ndquadrant.com> wrote: > That's 5.4X as fast; not too shabby! Awesome! > And with the large difference in response time, things appear to be working > as hoped even in this situation. If you try this on your laptop, where > drive cache size and random I/O are likely to be even slower, you might see > an ever larger difference. Hah! Just in case a 5.4X performance improvement isn't good enough? :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 08/14/2011 12:31 AM, Heikki Linnakangas wrote: >> The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio wouldhave the problem of feedback loops, though. > Yeah, I'm not excited about making the planner and statistics more > dynamic. Feedback loops and plan instability are not fun. I might be a little out of my league here... But I was thinking about the cache hit ratio and feedback loops. I understand automatic tuning would be hard. But making automatic tuning easier (by using pg_tune for example) would be a big plus for most use cases. To make it easier to tune the page read costs automatically, it would be nice if there would be four variables instead of the current two: - random_page_cost is the cost of reading a random pagefrom storage. Currently it is not, it is the cost of accessing a random page, taking in account it might be in memory. - seq_page_cost is the cost of reading pages sequentially from storage - memory_page_costis the cost of reading a page in memory - cache_hit_ratio is the expected cache hit ratio memory_page_cost would be server global, random and seq page costs tablespace specific, and cache_hit_ratio relation specific. You would get the current behavior by tuning *_page_costs realistically, and setting cache_hit_ratio globally so that the expected random_page_cost / seq_page_cost stays the same as now. The biggest advantage of this would be that the correct values are much easier to detect automatically compared to current situation. This can be done using pg_statio_* views and IO speed testing. They should not be tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads to the possibility of feedback loops and plan instability. The variables would also be much easier to understand. There is the question if one should be allowed to tune the *_page_costs at all. If I am not missing something, it is possible to detect the correct values programmatically and they do not change if you do not change the hardware. Cache hit ratio is the real reason why they are currently so important for tuning. An example why the current random_page_cost and seq_page_cost tuning is not adequate is that you can only set random_page_cost per tablespace. That makes perfect sense if random_page_cost would be the cost of accessing a page in storage. But it is not, it is a combination of that and caching effects, so that it actually varies per relation (and over time). How do you set it correctly for a query where one relation is fully cached and another one not? Another problem is that if you use random_page_cost == seq_page_cost, you are effectively saying that everything is in cache. But if everything is in cache, the cost of page access relative to cpu_*_costs is way off. The more random_page_cost and seq_page_cost are different, the more they mean the storage access costs. When they are the same, they mean the memory page cost. There can be an order of magnitude in difference of a storage page cost and a memory page cost. So it is hard to tune the cpu_*_costs realistically for cases where sometimes data is in cache and sometimes not. Ok, enough hand waving for one post :) Sorry if this all is obvious / discussed before. My googling didn't turn out anything directly related, although these have some similarity: - Per-table random_page_cost for tables that we know are always cached [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php] - Script to compute random page cost [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php] - The science of optimization in practical terms? [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting really interesting starting from here: [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php] - Anssi
2011/8/16 Anssi Kääriäinen <anssi.kaariainen@thl.fi>: > On 08/14/2011 12:31 AM, Heikki Linnakangas wrote: >>> >>> The same idea could of course be used to calculate the effective cache >>> hit ratio for each table. Cache hit ratio would have the problem of feedback >>> loops, though. >> >> Yeah, I'm not excited about making the planner and statistics more >> dynamic. Feedback loops and plan instability are not fun. > > I might be a little out of my league here... But I was thinking about the > cache hit ratio and feedback loops. I understand automatic tuning would be > hard. But making automatic tuning easier (by using pg_tune for example) > would be a big plus for most use cases. > > To make it easier to tune the page read costs automatically, it would be > nice if there would be four variables instead of the current two: > - random_page_cost is the cost of reading a random page from storage. > Currently it is not, it is the cost of accessing a random page, taking in > account it might be in memory. > - seq_page_cost is the cost of reading pages sequentially from storage > - memory_page_cost is the cost of reading a page in memory > - cache_hit_ratio is the expected cache hit ratio > > memory_page_cost would be server global, random and seq page costs > tablespace specific, and cache_hit_ratio relation specific. You would get > the current behavior by tuning *_page_costs realistically, and setting > cache_hit_ratio globally so that the expected random_page_cost / > seq_page_cost stays the same as now. > > The biggest advantage of this would be that the correct values are much > easier to detect automatically compared to current situation. This can be > done using pg_statio_* views and IO speed testing. They should not be tuned > automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads > to the possibility of feedback loops and plan instability. The variables > would also be much easier to understand. > > There is the question if one should be allowed to tune the *_page_costs at > all. If I am not missing something, it is possible to detect the correct > values programmatically and they do not change if you do not change the > hardware. Cache hit ratio is the real reason why they are currently so > important for tuning. > > An example why the current random_page_cost and seq_page_cost tuning is not > adequate is that you can only set random_page_cost per tablespace. That > makes perfect sense if random_page_cost would be the cost of accessing a > page in storage. But it is not, it is a combination of that and caching > effects, so that it actually varies per relation (and over time). How do you > set it correctly for a query where one relation is fully cached and another > one not? > > Another problem is that if you use random_page_cost == seq_page_cost, you > are effectively saying that everything is in cache. But if everything is in > cache, the cost of page access relative to cpu_*_costs is way off. The more > random_page_cost and seq_page_cost are different, the more they mean the > storage access costs. When they are the same, they mean the memory page > cost. There can be an order of magnitude in difference of a storage page > cost and a memory page cost. So it is hard to tune the cpu_*_costs > realistically for cases where sometimes data is in cache and sometimes not. > > Ok, enough hand waving for one post :) Sorry if this all is obvious / > discussed before. My googling didn't turn out anything directly related, > although these have some similarity: > - Per-table random_page_cost for tables that we know are always cached > [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php] > - Script to compute random page cost > [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php] > - The science of optimization in practical terms? > [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting > really interesting starting from here: > [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php] late reply. You can add this link to your list: http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On Tue, Aug 16, 2011 at 11:24 AM, Anssi Kääriäinen <anssi.kaariainen@thl.fi> wrote: > There is the question if one should be allowed to tune the *_page_costs at > all. If I am not missing something, it is possible to detect the correct > values programmatically and they do not change if you do not change the > hardware. Cache hit ratio is the real reason why they are currently so > important for tuning. Unfortunately things a tad more complex than this picture. There are multiple levels of cache involved here. There's the Postgres buffer cache, the filesystem buffer cache, and then the raid controller or drives often have cache as well. Also the difference between seq_page_cost and random_page_cost is hiding another cache effect. The reason sequential reads are faster is twofold, there's no seek but also there's an increased chance the buffer is already in the filesystem cache due to having been prefetched. Actually it's hardly even probabilistic -- only every nth page needs to do i/o when doing sequential reads. -- greg
On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > That is somewhat compensated by the fact that tuples that are accessed more > often are also more likely to be in cache. Fetching the heap tuple to check > visibility is very cheap when the tuple is in cache. > > I'm not sure how far that compensates it, though. I'm sure there's typically > nevertheless a fairly wide range of pages that have been modified since the > last vacuum, but not in cache anymore. Would it make sense to re-evaluate the visibility bit just before a page gets flushed out from shared buffers? On a system with no long transactions, it seems likely that a dirty page is already all-visible by the time bgwriter (or shared buffers memory pressure) gets around to writing it out. That way we don't have to wait for vacuum to do it and would make your observation hold more often. Regards, Marti
On Sun, Sep 25, 2011 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote: > On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> That is somewhat compensated by the fact that tuples that are accessed more >> often are also more likely to be in cache. Fetching the heap tuple to check >> visibility is very cheap when the tuple is in cache. >> >> I'm not sure how far that compensates it, though. I'm sure there's typically >> nevertheless a fairly wide range of pages that have been modified since the >> last vacuum, but not in cache anymore. > > Would it make sense to re-evaluate the visibility bit just before a > page gets flushed out from shared buffers? On a system with no long > transactions, it seems likely that a dirty page is already all-visible > by the time bgwriter (or shared buffers memory pressure) gets around > to writing it out. That way we don't have to wait for vacuum to do it > and would make your observation hold more often. This has been suggested before, and, sure, there might be cases where it helps. But you need to choose your test case fairly carefully. For example, if you're doing a large sequential scan on a table, the ring-buffer logic causes processes to evict their own pages, and so the background writer doesn't get a chance to touch any of those pages. You need some kind of a workload where pages are being evicted from shared buffers slowly enough that it ends up being the background writer, rather than the individual backends, that do the work. But if you have that kind of workload, then we can infer that most of your working set fits into shared buffers. And in that case you don't really need index-only scans in the first place. The main point of index only scans is to optimize the case where you have a gigantic table and you're trying to avoid swamping the system with random I/O. I'm not saying that such a change would be a particularly bad idea, but I'm not personally planning to work on it any time soon because I can't convince myself that it really helps all that much. I think the real solution to getting visibility map bits set is to vacuum more frequently, but have it be cheaper each time. Our default autovacuum settings vacuum the table when the number of updates and deletes reaches 20% of the table size. But those settings were put in place under the assumption that we'll have to scan the entire heap, dirtying every page that contains dead tuples, scan all the indexes and remove the associated index pointers, and then scan and dirty the heap pages that contain now-dead line pointers a second time to remove those. The visibility map has eroded those assumptions to some degree, because now we probably won't have to scan the entire heap every time we vacuum; and I'm hoping we're going to see some further erosion. Pavan has a pending patch which, if we can work out the details, will eliminate the second heap scan; and we've also talked about doing the index scan only when there are enough dead line pointers to justify the effort. That, it seems to me, would open the door to lowering the scale factor, maybe by quite a bit - which, in turn, would help us control bloat better and get visibility map bits set sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Please find attached a patch implementing a basic version of > index-only scans. This patch is the work of my colleague Ibrar Ahmed > and myself, and also incorporates some code from previous patches > posted by Heikki Linnakanagas. I'm starting to review this patch now. Has any further work been done since the first version was posted? Also, was any documentation written? I'm a tad annoyed by having to reverse-engineer the changes in the AM API specification from the code. regards, tom lane
On Thu, Oct 6, 2011 at 3:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Please find attached a patch implementing a basic version of >> index-only scans. This patch is the work of my colleague Ibrar Ahmed >> and myself, and also incorporates some code from previous patches >> posted by Heikki Linnakanagas. > > I'm starting to review this patch now. Thanks! > Has any further work been done > since the first version was posted? Also, was any documentation > written? I'm a tad annoyed by having to reverse-engineer the changes > in the AM API specification from the code. Not really. We have detected a small performance regression when both heap and index fit in shared_buffers and an index-only scan is used. I have a couple of ideas for improving this. One is to store a virtual tuple into the slot instead of building a regular tuple, but what do we do about tuples with OIDs? Another is to avoid locking the index buffer multiple times - right now it locks the index buffer to get the TID, and then relocks it to extract the index tuple (after checking that nothing disturbing has happened meanwhile). It seems likely that with some refactoring we could get this down to a single lock/unlock cycle, but I haven't figured out exactly where the TID gets copied out. With regard to the AM API, the basic idea is we're just adding a Boolean to say whether the AM is capable of returning index tuples. If it's not, then we don't ever try an index-only scan. If it is, then we'll set the want_index_tuple flag if an index-only scan is possible. This requests that the AM attempt to return the tuple; but the AM is also allowed to fail and not return the tuple whenever it wants. This is more or less the interface that Heikki suggested a couple years back, but it might well be vulnerable to improvement. Incidentally, if you happen to feel the urge to beat this around and send it back rather than posting a list of requested changes, feel free. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Not really. We have detected a small performance regression when both > heap and index fit in shared_buffers and an index-only scan is used. > I have a couple of ideas for improving this. One is to store a > virtual tuple into the slot instead of building a regular tuple, but > what do we do about tuples with OIDs? Yeah, I was just looking at that. I think it's going to be a clear win to use a virtual tuple slot instead of constructing and deconstructing a HeapTuple. The obvious solution is to decree that you can't use an index-only scan if the query requires fetching OID (or any other system column). This would be slightly annoying for catalog fetches but I'm not sure I believe that catalog queries are that important a use-case. I was also toying with the notion of pushing the slot fill-in into the AM, so that the AM API is to return a filled TupleSlot not an IndexTuple. This would not save any cycles AFAICT --- at least in btree, we still have to make a palloc'd copy of the IndexTuple so that we can release lock on the index page. The point of it would be to avoid the assumption that the index's internal storage has exactly the format of IndexTuple. Don't know whether we'd ever have any actual use for that flexibility, but it seems like it wouldn't cost much to preserve the option. > Another is to avoid locking the > index buffer multiple times - right now it locks the index buffer to > get the TID, and then relocks it to extract the index tuple (after > checking that nothing disturbing has happened meanwhile). It seems > likely that with some refactoring we could get this down to a single > lock/unlock cycle, but I haven't figured out exactly where the TID > gets copied out. Yeah, maybe, but let's get the patch committed before we start looking for second-order optimizations. > With regard to the AM API, the basic idea is we're just adding a > Boolean to say whether the AM is capable of returning index tuples. > If it's not, then we don't ever try an index-only scan. If it is, > then we'll set the want_index_tuple flag if an index-only scan is > possible. This requests that the AM attempt to return the tuple; but > the AM is also allowed to fail and not return the tuple whenever it > wants. It was the "allowed to fail" bit that I was annoyed to discover only by reading some well-buried code. > Incidentally, if you happen to feel the urge to beat this around and > send it back rather than posting a list of requested changes, feel > free. You can count on that ;-). I've already found a lot of things I didn't care for. regards, tom lane
On 6 October 2011 21:11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Not really. We have detected a small performance regression when both >> heap and index fit in shared_buffers and an index-only scan is used. >> I have a couple of ideas for improving this. One is to store a >> virtual tuple into the slot instead of building a regular tuple, but >> what do we do about tuples with OIDs? > > Yeah, I was just looking at that. I think it's going to be a clear win > to use a virtual tuple slot instead of constructing and deconstructing > a HeapTuple. The obvious solution is to decree that you can't use an > index-only scan if the query requires fetching OID (or any other system > column). This would be slightly annoying for catalog fetches but I'm > not sure I believe that catalog queries are that important a use-case. > > I was also toying with the notion of pushing the slot fill-in into the > AM, so that the AM API is to return a filled TupleSlot not an > IndexTuple. This would not save any cycles AFAICT --- at least in > btree, we still have to make a palloc'd copy of the IndexTuple so that > we can release lock on the index page. The point of it would be to > avoid the assumption that the index's internal storage has exactly the > format of IndexTuple. Don't know whether we'd ever have any actual use > for that flexibility, but it seems like it wouldn't cost much to > preserve the option. > >> Another is to avoid locking the >> index buffer multiple times - right now it locks the index buffer to >> get the TID, and then relocks it to extract the index tuple (after >> checking that nothing disturbing has happened meanwhile). It seems >> likely that with some refactoring we could get this down to a single >> lock/unlock cycle, but I haven't figured out exactly where the TID >> gets copied out. > > Yeah, maybe, but let's get the patch committed before we start looking > for second-order optimizations. +1 Been testing this today with a few regression tests and haven't seen anything noticeably broken. Does what it says on the tin. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Please find attached a patch implementing a basic version of > index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps "enable_indexonlyscan". Anyone object, or want to bikeshed the name? regards, tom lane
On 10/07/2011 11:40 AM, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> Please find attached a patch implementing a basic version of >> index-only scans. > > I'm making some progress with this, but I notice what seems like a > missing feature: there needs to be a way to turn it off. Otherwise > performance comparisons will be difficult to impossible. > > The most obvious solution is a planner control GUC, perhaps > "enable_indexonlyscan". Anyone object, or want to bikeshed the name? enable_onlyindexscan I'm kidding. +1 on Tom's proposed name. > > regards, tom lane > -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579
On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Please find attached a patch implementing a basic version of >> index-only scans. > > I'm making some progress with this, but I notice what seems like a > missing feature: there needs to be a way to turn it off. Otherwise > performance comparisons will be difficult to impossible. > > The most obvious solution is a planner control GUC, perhaps > "enable_indexonlyscan". Anyone object, or want to bikeshed the name? I was expecting you to NOT want that, or I would have put it in to begin with... so go for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm making some progress with this, but I notice what seems like a >> missing feature: there needs to be a way to turn it off. �Otherwise >> performance comparisons will be difficult to impossible. >> >> The most obvious solution is a planner control GUC, perhaps >> "enable_indexonlyscan". �Anyone object, or want to bikeshed the name? > I was expecting you to NOT want that, or I would have put it in to > begin with... so go for it. It seems unlikely to have any use except for testing, but I think we need it for that. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Please find attached a patch implementing a basic version of > index-only scans. This patch is the work of my colleague Ibrar Ahmed > and myself, and also incorporates some code from previous patches > posted by Heikki Linnakanagas. I've committed this after some rather substantial editorialization. There's still a lot left to do of course, but it seems to need performance testing next, and that'll be easier if the code is in HEAD. > 1. The way that nodeIndexscan.c builds up the faux heap tuple is > perhaps susceptible to improvement. I thought about building a > virtual tuple, but then what do I do with an OID column, if I have > one? Or maybe this should be done some other way altogether. I switched it to use a virtual tuple for now, and just not attempt to use index-only scans if a system column is required. We're likely to want to rethink this anyway, because as currently constituted the code can't do anything with an expression index, and avoiding recalculation of an expensive function could be a nice win. But the approach of just building a faux heap tuple fundamentally doesn't work for that. > 2. Suppose we scan one tuple on a not-all-visible page followed by 99 > tuples on all-visible pages. The code as written will hold the pin on > the first heap page for the entire scan. As soon as we hit the end of > the scan or another tuple where we have to actually visit the page, > the old pin will be released, but until then we hold onto it. I did not do anything about this issue --- ISTM it needs performance testing. > 3. The code in create_index_path() builds up a bitmapset of heap > attributes that get used for any purpose anywhere in the query, and > hangs it on the RelOptInfo so it doesn't need to be rebuilt for every > index under consideration. However, if it were somehow possible to > have the rel involved without using any attributes at all, we'd > rebuild the cache over and over, since it would never become non-NULL. I dealt with this by the expedient of getting rid of the caching ;-). It's not clear to me that it was worth the trouble, and in any case it's fundamentally wrong to suppose that every index faces the same set of attributes it must supply. It should not need to supply columns that are only needed in indexquals or index predicate conditions. I'm not sure how to deal with those refinements cheaply enough, but the cache isn't helping. > 4. There are a couple of cases that use index-only scans even though > the EXPLAIN output sort of makes it look like they shouldn't. For > example, in the above queries, an index-only scan is chosen even > though the query does "SELECT *" from the table being scanned. Even > though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems > that the target list of an EXISTS query is in fact discarded, e.g.: The reason it looks that way is that we're choosing to use a "physical result tuple" to avoid an ExecProject step at runtime. There's nothing wrong with the logic, it's just that EXPLAIN shows something that might mislead people. > 5. We haven't made any planner changes at all, not even for cost > estimation. It is not clear to me what the right way to do cost > estimation here is. Yeah, me either. For the moment I put in a hard-wired estimate that only 90% of the heap pages would actually get fetched. This is conservative and only meant to ensure that the planner picks an index-only-capable plan over an indexscan with a non-covering index, all else being equal. We'll need to do performance testing before we can refine that. regards, tom lane
On Fri, Oct 7, 2011 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 1. The way that nodeIndexscan.c builds up the faux heap tuple is >> perhaps susceptible to improvement. I thought about building a >> virtual tuple, but then what do I do with an OID column, if I have >> one? Or maybe this should be done some other way altogether. > > I switched it to use a virtual tuple for now, and just not attempt to > use index-only scans if a system column is required. We're likely to > want to rethink this anyway, because as currently constituted the code > can't do anything with an expression index, and avoiding recalculation > of an expensive function could be a nice win. But the approach of > just building a faux heap tuple fundamentally doesn't work for that. Figuring out how to fix that problem likely requires more knowledge of the executor than I have got. >> 2. Suppose we scan one tuple on a not-all-visible page followed by 99 >> tuples on all-visible pages. The code as written will hold the pin on >> the first heap page for the entire scan. As soon as we hit the end of >> the scan or another tuple where we have to actually visit the page, >> the old pin will be released, but until then we hold onto it. > > I did not do anything about this issue --- ISTM it needs performance > testing. I'm actually less worried about any performance problem than I am about the possibility of holding up VACUUM. That can happen the old way, too, but now the pin could stay on the same page for quite a while even when the scan is advancing. I think we maybe ought to think seriously about solving the problem at the other end, though - either make VACUUM skip pages that it can't get a cleanup lock on without blocking (except in anti-wraparound mode) or have it just do the amount of work that can be done with an exclusive lock (i.e. prune but not defragment, which would work even in anti-wraparound mode). That would solve the problems of (1) undetected VACUUM deadlock vs. a buffer pin, (2) indefinite VACUUM stall due to a suspended query, and (3) this issue. >> 3. The code in create_index_path() builds up a bitmapset of heap >> attributes that get used for any purpose anywhere in the query, and >> hangs it on the RelOptInfo so it doesn't need to be rebuilt for every >> index under consideration. However, if it were somehow possible to >> have the rel involved without using any attributes at all, we'd >> rebuild the cache over and over, since it would never become non-NULL. > > I dealt with this by the expedient of getting rid of the caching ;-). > It's not clear to me that it was worth the trouble, and in any case > it's fundamentally wrong to suppose that every index faces the same > set of attributes it must supply. It should not need to supply columns > that are only needed in indexquals or index predicate conditions. > I'm not sure how to deal with those refinements cheaply enough, but > the cache isn't helping. Oh, hmm. >> 4. There are a couple of cases that use index-only scans even though >> the EXPLAIN output sort of makes it look like they shouldn't. For >> example, in the above queries, an index-only scan is chosen even >> though the query does "SELECT *" from the table being scanned. Even >> though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems >> that the target list of an EXISTS query is in fact discarded, e.g.: > > The reason it looks that way is that we're choosing to use a "physical > result tuple" to avoid an ExecProject step at runtime. There's nothing > wrong with the logic, it's just that EXPLAIN shows something that might > mislead people. I wonder if we oughta do something about that. I was also thinking we should probably make EXPLAIN ANALYZE display the number of heap fetches, so that you can see how index-only your index-only scan actually was. >> 5. We haven't made any planner changes at all, not even for cost >> estimation. It is not clear to me what the right way to do cost >> estimation here is. > > Yeah, me either. For the moment I put in a hard-wired estimate that > only 90% of the heap pages would actually get fetched. This is > conservative and only meant to ensure that the planner picks an > index-only-capable plan over an indexscan with a non-covering index, > all else being equal. We'll need to do performance testing before > we can refine that. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Not really. We have detected a small performance regression when both >> heap and index fit in shared_buffers and an index-only scan is used. >> I have a couple of ideas for improving this. One is to store a >> virtual tuple into the slot instead of building a regular tuple, but >> what do we do about tuples with OIDs? [ that's done ] > I was also toying with the notion of pushing the slot fill-in into the > AM, so that the AM API is to return a filled TupleSlot not an > IndexTuple. This would not save any cycles AFAICT --- at least in > btree, we still have to make a palloc'd copy of the IndexTuple so that > we can release lock on the index page. The point of it would be to > avoid the assumption that the index's internal storage has exactly the > format of IndexTuple. Don't know whether we'd ever have any actual use > for that flexibility, but it seems like it wouldn't cost much to > preserve the option. BTW, I concluded that that would be a bad idea, because it would involve the index AM in some choices that we're likely to want to change. In particular it would foreclose ever doing anything with expression indexes, without an AM API change. Better to just define the AM's responsibility as to hand back a tuple defined according to the index's columns. >> Another is to avoid locking the >> index buffer multiple times - right now it locks the index buffer to >> get the TID, and then relocks it to extract the index tuple (after >> checking that nothing disturbing has happened meanwhile). It seems >> likely that with some refactoring we could get this down to a single >> lock/unlock cycle, but I haven't figured out exactly where the TID >> gets copied out. > Yeah, maybe, but let's get the patch committed before we start looking > for second-order optimizations. On reflection I'm starting to think that the above would be a good idea because there are a couple of bogosities in the basic choices this patch made. In particular, I'm thinking about how we could use an index on f(x) to avoid recalculating f() in something like select f(x) from tab where f(x) < 42; assuming that f() is expensive but immutable. The planner side of this is already a bit daunting, because it's not clear how to recognize that an index on f(x) is a covering index (the existing code is going to think that x itself needs to be available). But the executor side is a real problem, because it will fail to make use of the f() value fetched from the index anytime the heap visibility test fails. I believe that we should rejigger things so that when an index-only scan is selected, the executor *always* works from the data supplied by the index. Even if it has to visit the heap --- it will do that but just to consult the tuple's visibility data, and then use what it got from the index anyway. This means we'd build the plan node's filter quals and targetlist to reference the index tuple columns not the underlying table's. (Which in passing gets rid of the behavior you were complaining about that EXPLAIN VERBOSE shows a lot of columns that aren't actually being computed.) In order to make this work, we have to remove the API wart that says the index AM is allowed to choose to not return the index tuple. And that ties into what you were saying above. What we ought to do is have _bt_readpage() save off copies of the whole tuples not only the TIDs when it is extracting data from the page. This is no more net copying work than what happens now (assuming that all the tuples get fetched) since we won't need the per-tuple memcpy that occurs now in bt_getindextuple. The tuples can go into a page-sized workspace buffer associated with the BTScanPosData structure, and then just be referenced in-place in that workspace with no second copy step. I'm inclined to do the last part immediately, since there's a performance argument for it, and then work on revising the executor implementation. regards, tom lane
On Oct 8, 2011, at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm inclined to do the last part immediately, since there's a > performance argument for it, and then work on revising the executor > implementation. Sounds great. Thanks for working on this. ...Robert
I wrote: > I believe that we should rejigger things so that when an index-only scan > is selected, the executor *always* works from the data supplied by the > index. Even if it has to visit the heap --- it will do that but just to > consult the tuple's visibility data, and then use what it got from the > index anyway. This means we'd build the plan node's filter quals and > targetlist to reference the index tuple columns not the underlying > table's. I've been studying this a bit. The key decision is how to represent Vars that reference columns of the index. We really have to have varattno equal to the index column number, else ExecEvalVar will pull the wrong column from the tuple. However, varno is not so clear cut. There are at least four things we could do: 1. Keep varno = table's rangetable index. The trouble with this is that a Var referencing index column N would look exactly like a Var referencing table column N; so the same Var would mean something different in an index-only scan node than it does in any other type of scan node for the same table. We could maybe make that work, but it seems confusing and fragile as heck. The executor isn't going to care much, but inspection of the plan tree by e.g. EXPLAIN sure will. 2. Set varno = OUTER (or maybe INNER). This is safe because there's no other use for OUTER/INNER in a table scan node. We would have to hack things so that the index tuple gets put into econtext->ecxt_outertuple (resp. ecxt_innertuple) at runtime, but that seems like no big problem. In both setrefs.c and ruleutils.c, it would be desirable to have a TargetEntry list somewhere representing the index columns, which setrefs would want so it could set up the special Var nodes with fix_upper_expr, and ruleutils would want so it could interpret the Vars using existing machinery. I'm not sure whether to hang that list on the index-only plan node or expect EXPLAIN to regenerate it at need. 3. Invent another special varno value similar to OUTER/INNER but representing an index reference. This is just about like #2 except that we could still put the index tuple into econtext->ecxt_scantuple, and ExecEvalVar would do the right thing as it stands. 4. Create a rangetable entry specifically representing the index, and set varno equal to that RTE's number. This has some attractiveness in terms of making the meaning of the Vars clear, but an RTE that represents an index rather than a table seems kind of ugly otherwise. It would likely require changes in unrelated parts of the code. One point here is that we have historically used solution #1 to represent the index keys in index qual expressions. We avoid the ambiguity issues by not asking EXPLAIN to try to interpret the indexqual tree at all: it works from indexqualorig which contains ordinary Vars. So one way to dodge the disadvantages of solution #1 would be to add untransformed "targetlistorig" and "qualorig" fields to an index-only plan node, and use those for EXPLAIN. However, those fields would be totally dead weight if the plan were never EXPLAINed, whereas indexqualorig has a legitimate use for rechecking indexquals against the heap tuple in case of a lossy index. (BTW, if we go with any solution other than #1, I'm strongly inclined to change the representation of indexqual to match. See the comments in fix_indexqual_operand.) At the moment I'm leaning to approach #3, but I wonder if anyone has a different opinion or another idea altogether. regards, tom lane
On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > At the moment I'm leaning to approach #3, but I wonder if anyone has > a different opinion or another idea altogether. > Would any of these make it more realistic to talk about the crazy plans Heikki suggested like doing two index scans, doing the join between the index tuples, and only then looking up the visibility information and remaining columns for the tuple on the matching rows? -- greg
Greg Stark <stark@mit.edu> writes: > On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> At the moment I'm leaning to approach #3, but I wonder if anyone has >> a different opinion or another idea altogether. > Would any of these make it more realistic to talk about the crazy > plans Heikki suggested like doing two index scans, doing the join > between the index tuples, and only then looking up the visibility > information and remaining columns for the tuple on the matching rows? I don't think it's particularly relevant --- we would not want to use weird representations of the Vars outside the index scan nodes. Above the scan they'd be just like any other upper-level Vars. (FWIW, that idea isn't crazy; I remember having discussions of it back in 2003 or so.) regards, tom lane
On Sun, Oct 9, 2011 at 10:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't think it's particularly relevant --- we would not want to use > weird representations of the Vars outside the index scan nodes. Above > the scan they'd be just like any other upper-level Vars. I can't say I fully understand the planner data structures and the implications of the options. I guess what I was imagining was that being able to reference the indexes as regular rangetable entries would make it more convenient for the rest of the planner to keep working as if nothing had changed when working with values extracted from index tuples. -- greg
I wrote: > There are at least four things we could do: ... > 2. Set varno = OUTER (or maybe INNER). This is safe because there's no > other use for OUTER/INNER in a table scan node. We would have to hack > things so that the index tuple gets put into econtext->ecxt_outertuple > (resp. ecxt_innertuple) at runtime, but that seems like no big problem. > In both setrefs.c and ruleutils.c, it would be desirable to have a > TargetEntry list somewhere representing the index columns, which setrefs > would want so it could set up the special Var nodes with fix_upper_expr, > and ruleutils would want so it could interpret the Vars using existing > machinery. I'm not sure whether to hang that list on the index-only > plan node or expect EXPLAIN to regenerate it at need. > 3. Invent another special varno value similar to OUTER/INNER but > representing an index reference. This is just about like #2 except that > we could still put the index tuple into econtext->ecxt_scantuple, and > ExecEvalVar would do the right thing as it stands. I have mostly-working code for approach #3, but I haven't tried to make EXPLAIN work yet. While looking at that I realized that there's a pretty good argument for adding the above-mentioned explicit TargetEntry list representing the index columns to index-only plan nodes. Namely, that if we don't do it, EXPLAIN will have to go to the catalogs to find out what's in that index, and this will fall down for "hypothetical" indexes injected into the planner by index advisors. We could imagine adding some more hooks to let the advisor inject bogus catalog data at EXPLAIN time, but on the whole it seems easier and less fragile to just have the planner include a data structure it has to build anyway into the finished plan. The need for this additional node list field also sways me in a direction that I'd previously been on the fence about, namely that I think index-only scans need to be their own independent plan node type instead of sharing a node type with regular indexscans. It's just too weird that a simple boolean indexonly property would mean completely different contents/interpretation of the tlist and quals. I've run into one other thing that's going to need to be hacked up a bit: index-only scans on "name" columns fall over with this modified code, because there's now tighter checking of the implied tuple descriptors: regression=# select relname from pg_class where relname = 'tenk1'; ERROR: attribute 1 has wrong type DETAIL: Table has type cstring, but query expects name. The reason for this is the hack we put in some time back to conserve space in system catalog indexes by having "name" columns be indexed as though they were "cstring", cf commit 5f6f840e93a3649e0d07e85bad188d163e96ec0e. We will probably need some compensatory hack in index-only scans, unless we can think of a less klugy way of representing that optimization. (Basically, the index-only code is assuming that btrees don't have storage type distinct from input type, and that's not the case for the name opclass. I had kind of expected the original patch to have some issues with that too, and I'm still not fully convinced that there aren't corner cases where it'd be an issue even with the currently committed code.) regards, tom lane
On Mon, Oct 10, 2011 at 2:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The need for this additional node list field also sways me in a > direction that I'd previously been on the fence about, namely that > I think index-only scans need to be their own independent plan node type > instead of sharing a node type with regular indexscans At a superficial PR level it'll go over quite well to have a special plan node be visible in the explain output. People will love to see Fast Index Scan or Covering Index Scan or whatever you call it in their plans. -- greg
I wrote: > I have mostly-working code for approach #3, but I haven't tried to make > EXPLAIN work yet. While looking at that I realized that there's a > pretty good argument for adding the above-mentioned explicit TargetEntry > list representing the index columns to index-only plan nodes. Namely, > that if we don't do it, EXPLAIN will have to go to the catalogs to find > out what's in that index, and this will fall down for "hypothetical" > indexes injected into the planner by index advisors. We could imagine > adding some more hooks to let the advisor inject bogus catalog data at > EXPLAIN time, but on the whole it seems easier and less fragile to just > have the planner include a data structure it has to build anyway into > the finished plan. > The need for this additional node list field also sways me in a > direction that I'd previously been on the fence about, namely that > I think index-only scans need to be their own independent plan node type > instead of sharing a node type with regular indexscans. It's just too > weird that a simple boolean indexonly property would mean completely > different contents/interpretation of the tlist and quals. Attached is a draft patch for this. It needs some more review before committing, but it does pass regression tests now. One point worth commenting on is that I chose to rename the OUTER and INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with adding a new special varno INDEX_VAR. It's bothered me for some time that those macro names were way too generic/susceptible to collision; and since I had to look at all the uses anyway to see if the INDEX case needed to be added, this seemed like a good time to rename them. regards, tom lane
Attachment
On Tue, Oct 11, 2011 at 12:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I have mostly-working code for approach #3, but I haven't tried to make >> EXPLAIN work yet. While looking at that I realized that there's a >> pretty good argument for adding the above-mentioned explicit TargetEntry >> list representing the index columns to index-only plan nodes. Namely, >> that if we don't do it, EXPLAIN will have to go to the catalogs to find >> out what's in that index, and this will fall down for "hypothetical" >> indexes injected into the planner by index advisors. We could imagine >> adding some more hooks to let the advisor inject bogus catalog data at >> EXPLAIN time, but on the whole it seems easier and less fragile to just >> have the planner include a data structure it has to build anyway into >> the finished plan. > >> The need for this additional node list field also sways me in a >> direction that I'd previously been on the fence about, namely that >> I think index-only scans need to be their own independent plan node type >> instead of sharing a node type with regular indexscans. It's just too >> weird that a simple boolean indexonly property would mean completely >> different contents/interpretation of the tlist and quals. > > Attached is a draft patch for this. It needs some more review before > committing, but it does pass regression tests now. > > One point worth commenting on is that I chose to rename the OUTER and > INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with > adding a new special varno INDEX_VAR. It's bothered me for some time > that those macro names were way too generic/susceptible to collision; > and since I had to look at all the uses anyway to see if the INDEX case > needed to be added, this seemed like a good time to rename them. +1 for that renaming, for sure. Have you given any thought to what would be required to support index-only scans on non-btree indexes - particularly, GIST? ISTM we might have had a thought that some GIST opclasses but not others would be able to regurgitate tuples, or maybe even that it might vary from index tuple to index tuple. But that discussion was a long time ago, and my memory is fuzzy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 11, 2011 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
index-only scans on non-btree indexes - particularly, GIST? ISTM weHave you given any thought to what would be required to support
might have had a thought that some GIST opclasses but not others would
be able to regurgitate tuples, or maybe even that it might vary from
index tuple to index tuple. But that discussion was a long time ago,
and my memory is fuzzy.
In some GiST opclasses original tuple can't be restored from index tuple. For example, polygon can't be restored from it's MBR. In some opclasses, for example box_ops and point_ops, original tuple can be easily restore from index tuple.
I can't remeber any implementation where this possibility can vary from index tuple to index tuple. GiST opclass for ts_vector have different representation of leaf index tuple depending on it's length. But, at most, it store array of hashes of words, so it's lossy anyway.
I think GiST interface could be extended with optional function which restores original tuple. But there is some additional difficulty, when some opclasses of composite index provide this function, but others - not.
With best regards,
Alexander Korotkov.
Robert Haas <robertmhaas@gmail.com> writes: > Have you given any thought to what would be required to support > index-only scans on non-btree indexes - particularly, GIST? ISTM we > might have had a thought that some GIST opclasses but not others would > be able to regurgitate tuples, or maybe even that it might vary from > index tuple to index tuple. But that discussion was a long time ago, > and my memory is fuzzy. It would have to be a per-opclass property, for sure, since the basic criterion is whether the opclass's "compress" function is lossy. I don't think we can tolerate a situation where some of the index tuples might be able to yield data and others not --- that would undo all the improvements I've been working on over the past few days. I haven't thought as far ahead as how we might get the information needed for a per-opclass flag. A syntax addition to CREATE OPERATOR CLASS might be the only way. regards, tom lane
On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
I haven't thought as far ahead as how we might get the information
needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
CLASS might be the only way.
Shouldn't it be implemented through additional interface function? There are situations when restoring of original tuple requires some transformation. For example, in point_ops we store box in the leaf index tuple, while point can be easily restored from box.
With best regards,
Alexander Korotkov.
On Oct 7, 2011, at 8:47 PM, Joshua D. Drake wrote: > > On 10/07/2011 11:40 AM, Tom Lane wrote: >> Robert Haas<robertmhaas@gmail.com> writes: >>> Please find attached a patch implementing a basic version of >>> index-only scans. >> >> I'm making some progress with this, but I notice what seems like a >> missing feature: there needs to be a way to turn it off. Otherwise >> performance comparisons will be difficult to impossible. >> >> The most obvious solution is a planner control GUC, perhaps >> "enable_indexonlyscan". Anyone object, or want to bikeshed the name? > > enable_onlyindexscan > > I'm kidding. > > +1 on Tom's proposed name. +1 ... definitely an important thing to do. regards, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I haven't thought as far ahead as how we might get the information >> needed for a per-opclass flag. A syntax addition to CREATE OPERATOR >> CLASS might be the only way. >> > Shouldn't it be implemented through additional interface function? There are > situations when restoring of original tuple requires some transformation. > For example, in point_ops we store box in the leaf index tuple, while point > can be easily restored from box. Hm. I had been supposing that lossless compress functions would just be no-ops. If that's not necessarily the case then we might need something different from the opclass's decompress function to get back the original data. However, that doesn't really solve the problem I'm concerned about, because the existence and use of such a function would be entirely internal to GiST. There still needs to be a way for the planner to know which opclasses support data retrieval. And I do *not* want to see us hard-wire "the presence of opclass function 8 means a GiST opclass can return data" into the planner. Maybe, instead of a simple constant amcanreturn column, we need an AM API function that says whether the index can return data. regards, tom lane
On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------Hm. I had been supposing that lossless compress functions would just be
no-ops. If that's not necessarily the case then we might need something
different from the opclass's decompress function to get back the
original data. However, that doesn't really solve the problem I'm
concerned about, because the existence and use of such a function would
be entirely internal to GiST. There still needs to be a way for the
planner to know which opclasses support data retrieval. And I do *not*
want to see us hard-wire "the presence of opclass function 8 means a
GiST opclass can return data" into the planner.
Maybe, instead of a simple constant amcanreturn column, we need an AM
API function that says whether the index can return data.
I like idea of such AM API function. Since single multicolumn index can use multiple opclasses, AM API function should also say *what* data index can return.
With best regards,
Alexander Korotkov.
Tom Lane <tgl@sss.pgh.pa.us> writes: > I haven't thought as far ahead as how we might get the information > needed for a per-opclass flag. A syntax addition to CREATE OPERATOR > CLASS might be the only way. It looks to me like it's related to the RECHECK property. Maybe it's just too late, though. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe, instead of a simple constant amcanreturn column, we need an AM >> API function that says whether the index can return data. > I like idea of such AM API function. Since single multicolumn index can use > multiple opclasses, AM API function should also say *what* data index can > return. I was thinking more like "amcanreturn(index, column_number) returns bool" which says if the index can return the data for that column. The AM would still have to return a full IndexTuple at runtime, but it'd be allowed to insert nulls or garbage for columns it hadn't promised to return. BTW, if we do this, I'm rather strongly tempted to get rid of the name-versus-cstring hack (see index_descriptor_hack() in HEAD) by defining btree name_ops as not capable of returning data. I don't trust that hack much at all. regards, tom lane
I wrote: >> I was also toying with the notion of pushing the slot fill-in into the >> AM, so that the AM API is to return a filled TupleSlot not an >> IndexTuple. This would not save any cycles AFAICT --- at least in >> btree, we still have to make a palloc'd copy of the IndexTuple so that >> we can release lock on the index page. The point of it would be to >> avoid the assumption that the index's internal storage has exactly the >> format of IndexTuple. Don't know whether we'd ever have any actual use >> for that flexibility, but it seems like it wouldn't cost much to >> preserve the option. > BTW, I concluded that that would be a bad idea, because it would involve > the index AM in some choices that we're likely to want to change. In > particular it would foreclose ever doing anything with expression > indexes, without an AM API change. Better to just define the AM's > responsibility as to hand back a tuple defined according to the index's > columns. Although this aspect of the code is now working well enough for btree, I realized that it's going to have a problem if/when we add GiST support. The difficulty is that the index rowtype includes "storage" datatypes, not underlying-heap-column datatypes, for opclasses where those are different. This is not going to do for cases where we need to reconstruct a heap value from the index contents, as in Alexander's example of gist point_ops using a box as the underlying storage. What we actually want back from the index AM is a rowtype that matches the list of values submitted for indexing (ie, the original output of FormIndexDatum), and only for btree is it the case that that's represented more or less exactly as the IndexTuple stored in the index. So what I'm now thinking is to go back to the idea of having the index AM fill in a TupleTableSlot. For btree this would just amount to moving the existing StoreIndexTuple function into the AM. But it would give GiST the opportunity to do some computation, and it would avoid the problem of the index's rowtype not being a suitable intermediate format. The objection I voiced above is misguided, because it confuses the set of column types that's needed with the format distinction between a Slot and an IndexTuple. BTW, right at the moment I'm not that excited about actually doing any work on GiST itself for index-only scans. Given the current list of available opclasses there don't seem to be very many for which index-only scans would be possible, so the amount of work needed seems rather out of proportion to the benefit. But I don't mind fixing AM API details that are needed to make this workable. I'd rather have the API as right as possible in the first release. regards, tom lane
On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Please find attached a patch implementing a basic version of >> index-only scans. > > I'm making some progress with this, but I notice what seems like a > missing feature: there needs to be a way to turn it off. Otherwise > performance comparisons will be difficult to impossible. > > The most obvious solution is a planner control GUC, perhaps > "enable_indexonlyscan". Anyone object, or want to bikeshed the name? > Currently I can't force an indexonlyscan over an indexscan, because of the way the enable_* variables work. Should setting enable_indexscan=off also disable index-only scans (the current behavior) in addition to plain index scans? By way of precedent, enable_indexscan=off does not disable Bitmap Index Scan. Cheers, Jeff
On Sat, Oct 15, 2011 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Please find attached a patch implementing a basic version of >>> index-only scans. >> >> I'm making some progress with this, but I notice what seems like a >> missing feature: there needs to be a way to turn it off. Otherwise >> performance comparisons will be difficult to impossible. >> >> The most obvious solution is a planner control GUC, perhaps >> "enable_indexonlyscan". Anyone object, or want to bikeshed the name? >> > > Currently I can't force an indexonlyscan over an indexscan, because of > the way the enable_* variables work. OK, scratch that. I must have been using the wrong query (for which the index was not covering), as I can't reproduce the behavior nor looking at the code can I see how it could have occurred in the first place. Cheers, Jeff