Thread: Sequential scans
Hi, I'm starting to review the "synchronized scans" and "scan-resistant buffer cache" patches. The patches have complex interactions so I'm taking a holistic approach. There's four outstanding issues with the sync scans in particular: 1. The simplistic hash approach. While it's nice to not have a lock, I'm worried of collisions. If you had a collision every now and then, it wouldn't be that bad, but because the hash value is computed from the oid, a collision would be persistent. If you create a database and happen to have two frequently seqscanned tables that collide, the only way to get rid of the collision is to drop and recreate a table. Granted, that'd probably be very rare in practice, but when it happens it would be next to impossible to figure out what's going on. Let's use a normal hash table instead, and use a lock to protect it. If we only update it every 10 pages or so, the overhead should be negligible. To further reduce contention, we could modify ReadBuffer to let the caller know if the read resulted in a physical read or not, and only update the entry when a page is physically read in. That way all the synchronized scanners wouldn't be updating the same value, just the one performing the I/O. And while we're at it, let's use the full relfilenode instead of just the table oid in the hash. 2. Under what circumstances does the patch help and when does it hurt? I think the patch is safe in that it should never be any worse than what we have now. But when does it help? That needs to be looked at together with the other patch. I need to dig the archives for the performance test results you posted earlier and try to understand them. There's six distinct scenarios I've come up with this far that need to be looked at: A. A seq scan on a small table B. A seq scan on a table that's 110% the size of shared_buffers, but smaller than RAM C. A seq scan on a table that's 110% the size of RAM D. A seq scan on a huge table E. Two simultaneous seq scans on a large table starting at the same time F. Two simultaneous seq scans on a large table, 2nd one starting when the 1st one is halfway through Also, does it change things if you have a bulk update instead of read-only query? How about bitmap heap scans and large index scans? And vacuums? And the above scenarios need to be considered both alone, and in the presence of other OLTP kind of workload. I realize that we can't have everything, and as long as we get some significant benefit in some scenarios, and don't hurt others, the patch is worthwhile. But let's try to cover as much as we reasonably can. One random idea I had to cover B & C without having the offset variable: Start scanning *backwards* from the page that's in the shared hash table, until you hit a page that's not in buffer cache. Then you continue scanning forwards from the page you started from. This needs more thought but I think we can come up with a pretty simple solution that covers the most common cases. 3. By having different backends doing the reads, are we destroying OS readahead as Tom suggested? I remember you performed some tests on that, and it was a problem on some systems but not on others. This needs some thought, there may be some simple way to address that. 4. It fails regression tests. You get an assertion failure on the portal test. I believe that changing the direction of a scan isn't handled properly; it's probably pretty easy to fix. Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought, and take a closer look at the scan-resistant scans patch. Any comments and ideas are welcome, of course.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote: > Hi, > > I'm starting to review the "synchronized scans" and "scan-resistant > buffer cache" patches. The patches have complex interactions so I'm > taking a holistic approach. > > There's four outstanding issues with the sync scans in particular: > > 1. The simplistic hash approach. While it's nice to not have a lock, I'm > worried of collisions. If you had a collision every now and then, it > wouldn't be that bad, but because the hash value is computed from the > oid, a collision would be persistent. If you create a database and > happen to have two frequently seqscanned tables that collide, the only > way to get rid of the collision is to drop and recreate a table. > Granted, that'd probably be very rare in practice, but when it happens > it would be next to impossible to figure out what's going on. > > Let's use a normal hash table instead, and use a lock to protect it. If > we only update it every 10 pages or so, the overhead should be > negligible. To further reduce contention, we could modify ReadBuffer to > let the caller know if the read resulted in a physical read or not, and > only update the entry when a page is physically read in. That way all > the synchronized scanners wouldn't be updating the same value, just the > one performing the I/O. And while we're at it, let's use the full > relfilenode instead of just the table oid in the hash. What should be the maximum size of this hash table? Is there already- existing hash table code that I should use to be consistent with the rest of the code? I'm still trying to understand the effect of using the full relfilenode. Do you mean using the entire relation _segment_ as the key? That doesn't make sense to me. Or do you just mean using the relfilenode (without the segment) as the key? > 3. By having different backends doing the reads, are we destroying OS > readahead as Tom suggested? I remember you performed some tests on that, > and it was a problem on some systems but not on others. This needs some > thought, there may be some simple way to address that. Linux with CFQ I/O scheduler performs very poorly and inconsistently with concurrent sequential scans regardless of whether the scans are synchronized or not. I suspect the reason for this is that CFQ is designed to care more about the process issuing the request than any other factor. Every other I/O system performed either ideally (no interference between scans) or had some interference but still much better than current behavior. Of course, my tests are limited and there are many possible combinations of I/O systems that I did not try. > 4. It fails regression tests. You get an assertion failure on the portal > test. I believe that changing the direction of a scan isn't handled > properly; it's probably pretty easy to fix. > I will examine the code more carefully. As a first guess, is it possible that test is failing because of the non-deterministic order in which tuples are returned? > Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought, > and take a closer look at the scan-resistant scans patch. Any comments > and ideas are welcome, of course.. > Yes. I'll also try to address the other issues in your email. Thanks for your comments. Regards,Jeff Davis
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Let's use a normal hash table instead, and use a lock to protect it. If we only > update it every 10 pages or so, the overhead should be negligible. To further > reduce contention, we could modify ReadBuffer to let the caller know if the > read resulted in a physical read or not, and only update the entry when a page > is physically read in. That way all the synchronized scanners wouldn't be > updating the same value, just the one performing the I/O. And while we're at > it, let's use the full relfilenode instead of just the table oid in the hash. It's probably fine to just do that. But if we find it's a performance bottleneck we could probably still manage to avoid the lock except when actually inserting a new hash element. If you just store in the hash an index into an array stored in global memory then you could get away without a lock on the element in the array. It starts to get to be a fair amount of code when you think about how you would reuse elements of the array. That's why I suggest only looking at this if down the road we find that it's a bottleneck. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > >> Let's use a normal hash table instead, and use a lock to protect it. If we only >> update it every 10 pages or so, the overhead should be negligible. To further >> reduce contention, we could modify ReadBuffer to let the caller know if the >> read resulted in a physical read or not, and only update the entry when a page >> is physically read in. That way all the synchronized scanners wouldn't be >> updating the same value, just the one performing the I/O. And while we're at >> it, let's use the full relfilenode instead of just the table oid in the hash. > > It's probably fine to just do that. But if we find it's a performance > bottleneck we could probably still manage to avoid the lock except when > actually inserting a new hash element. If you just store in the hash an index > into an array stored in global memory then you could get away without a lock > on the element in the array. > > It starts to get to be a fair amount of code when you think about how you > would reuse elements of the array. That's why I suggest only looking at this > if down the road we find that it's a bottleneck. Another trick you could do is to use acquire the lock conditionally when updating it. But I doubt it's a problem anyhow, if we put some sane lower limit in there so that it's not used at all for small tables. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jeff Davis wrote: > On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote: >> Let's use a normal hash table instead, and use a lock to protect it. If >> we only update it every 10 pages or so, the overhead should be >> negligible. To further reduce contention, we could modify ReadBuffer to >> let the caller know if the read resulted in a physical read or not, and >> only update the entry when a page is physically read in. That way all >> the synchronized scanners wouldn't be updating the same value, just the >> one performing the I/O. And while we're at it, let's use the full >> relfilenode instead of just the table oid in the hash. > > What should be the maximum size of this hash table? Good question. And also, how do you remove entries from it? I guess the size should somehow be related to number of backends. Each backend will realistically be doing just 1 or max 2 seq scan at a time. It also depends on the number of large tables inthe databases, but we don't have that information easily available. How about using just NBackends? That should be plenty, but wasting a few hundred bytes of memory won't hurt anyone. I think you're going to need an LRU list and counter of used entries in addition to the hash table, and when all entries are in use, remove the least recently used one. The thing to keep an eye on is that it doesn't add too much overhead or lock contention in the typical case when there's no concurrent scans. For the locking, use a LWLock. > Is there already- > existing hash table code that I should use to be consistent with the > rest of the code? Yes, see utils/hash/dynahash.c, and ShmemInitHash (in storage/ipc/shmem.c) since it's in shared memory. There's plenty of examples that use hash tables, see for example storage/freespace/freespace.c. > I'm still trying to understand the effect of using the full relfilenode. > Do you mean using the entire relation _segment_ as the key? That doesn't > make sense to me. Or do you just mean using the relfilenode (without the > segment) as the key? No, not the segment. RelFileNode consists of tablespace oid, database oid and relation oid. You can find it in scan->rs_rd->rd_node. The segmentation works at a lower level. > Linux with CFQ I/O scheduler performs very poorly and inconsistently > with concurrent sequential scans regardless of whether the scans are > synchronized or not. I suspect the reason for this is that CFQ is > designed to care more about the process issuing the request than any > other factor. > > Every other I/O system performed either ideally (no interference between > scans) or had some interference but still much better than current > behavior. Hmm. Should we care then? CFG is the default on Linux, and an average sysadmin is unlikely to change it. What we could do quite easily is: - when ReadBuffer is called, let the caller know if the read did physical I/O. - when the previous ReadBuffer didn't result in physical I/O, assume that we're not the pack leader. If the next buffer isn't already in cache, wait a few milliseconds before initiating the read, giving the pack leader a chance to do it instead. Needs testing, of course.. >> 4. It fails regression tests. You get an assertion failure on the portal >> test. I believe that changing the direction of a scan isn't handled >> properly; it's probably pretty easy to fix. >> > > I will examine the code more carefully. As a first guess, is it possible > that test is failing because of the non-deterministic order in which > tuples are returned? No, it's an assertion failure, not just different output than expected. But it's probably quite simple to fix.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote: > Gregory Stark wrote: > > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > > > >> Let's use a normal hash table instead, and use a lock to protect it. If we only > >> update it every 10 pages or so, the overhead should be negligible. To further > >> reduce contention, we could modify ReadBuffer to let the caller know if the > >> read resulted in a physical read or not, and only update the entry when a page > >> is physically read in. That way all the synchronized scanners wouldn't be > >> updating the same value, just the one performing the I/O. And while we're at > >> it, let's use the full relfilenode instead of just the table oid in the hash. > > > > It's probably fine to just do that. But if we find it's a performance > > bottleneck we could probably still manage to avoid the lock except when > > actually inserting a new hash element. If you just store in the hash an index > > into an array stored in global memory then you could get away without a lock > > on the element in the array. > > > > It starts to get to be a fair amount of code when you think about how you > > would reuse elements of the array. That's why I suggest only looking at this > > if down the road we find that it's a bottleneck. > > Another trick you could do is to use acquire the lock conditionally when > updating it. But I doubt it's a problem anyhow, if we put some sane > lower limit in there so that it's not used at all for small tables. > The more sophisticated the data structure the less able we are to avoid locking, correct? For instance, if we have an LRU list it might be tricky or impossible to avoid locking even on just reads. Regards,Jeff Davis
Jeff Davis wrote: > On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote: >> Gregory Stark wrote: >>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >>> >>>> Let's use a normal hash table instead, and use a lock to protect it. If we only >>>> update it every 10 pages or so, the overhead should be negligible. To further >>>> reduce contention, we could modify ReadBuffer to let the caller know if the >>>> read resulted in a physical read or not, and only update the entry when a page >>>> is physically read in. That way all the synchronized scanners wouldn't be >>>> updating the same value, just the one performing the I/O. And while we're at >>>> it, let's use the full relfilenode instead of just the table oid in the hash. >>> It's probably fine to just do that. But if we find it's a performance >>> bottleneck we could probably still manage to avoid the lock except when >>> actually inserting a new hash element. If you just store in the hash an index >>> into an array stored in global memory then you could get away without a lock >>> on the element in the array. >>> >>> It starts to get to be a fair amount of code when you think about how you >>> would reuse elements of the array. That's why I suggest only looking at this >>> if down the road we find that it's a bottleneck. >> Another trick you could do is to use acquire the lock conditionally when >> updating it. But I doubt it's a problem anyhow, if we put some sane >> lower limit in there so that it's not used at all for small tables. >> > > The more sophisticated the data structure the less able we are to avoid > locking, correct? For instance, if we have an LRU list it might be > tricky or impossible to avoid locking even on just reads. Agreed. I'm not concerned about reads, though. You only need to read from the structure once when you start a scan. It's the updates that cause most of the traffic. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote: > Jeff Davis wrote: > > What should be the maximum size of this hash table? > > Good question. And also, how do you remove entries from it? > > I guess the size should somehow be related to number of backends. Each > backend will realistically be doing just 1 or max 2 seq scan at a time. > It also depends on the number of large tables in the databases, but we > don't have that information easily available. How about using just > NBackends? That should be plenty, but wasting a few hundred bytes of > memory won't hurt anyone. One entry per relation, not per backend, is my current design. > I think you're going to need an LRU list and counter of used entries in > addition to the hash table, and when all entries are in use, remove the > least recently used one. > > The thing to keep an eye on is that it doesn't add too much overhead or > lock contention in the typical case when there's no concurrent scans. > > For the locking, use a LWLock. > Ok. What would be the potential lock contention in the case of no concurrent scans? Also, is it easy to determine the space used by a dynahash with N entries? I haven't looked at the dynahash code yet, so perhaps this will be obvious. > No, not the segment. RelFileNode consists of tablespace oid, database > oid and relation oid. You can find it in scan->rs_rd->rd_node. The > segmentation works at a lower level. Ok, will do. > Hmm. Should we care then? CFG is the default on Linux, and an average > sysadmin is unlikely to change it. > Keep in mind that concurrent sequential scans with CFQ are *already* very poor. I think that alone is an interesting fact that's somewhat independent of Sync Scans. > - when ReadBuffer is called, let the caller know if the read did > physical I/O. > - when the previous ReadBuffer didn't result in physical I/O, assume > that we're not the pack leader. If the next buffer isn't already in > cache, wait a few milliseconds before initiating the read, giving the > pack leader a chance to do it instead. > > Needs testing, of course.. > An interesting idea. I like that the most out of the ideas of maintaining a "pack leader". That's very similar to what the Linux anticipatory scheduler does for us. > >> 4. It fails regression tests. You get an assertion failure on the portal > >> test. I believe that changing the direction of a scan isn't handled > >> properly; it's probably pretty easy to fix. > >> > > > > I will examine the code more carefully. As a first guess, is it possible > > that test is failing because of the non-deterministic order in which > > tuples are returned? > > No, it's an assertion failure, not just different output than expected. > But it's probably quite simple to fix.. > Ok, I'll find and correct it then. Regards,Jeff Davis
Jeff Davis wrote: > On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote: >> Jeff Davis wrote: >>> What should be the maximum size of this hash table? >> Good question. And also, how do you remove entries from it? >> >> I guess the size should somehow be related to number of backends. Each >> backend will realistically be doing just 1 or max 2 seq scan at a time. >> It also depends on the number of large tables in the databases, but we >> don't have that information easily available. How about using just >> NBackends? That should be plenty, but wasting a few hundred bytes of >> memory won't hurt anyone. > > One entry per relation, not per backend, is my current design. Umm, you naturally have just entry per relation, but we were talking about how many entries the table needs to hold.. You're patch had a hard-coded value of 1000 which is quite arbitrary. >> I think you're going to need an LRU list and counter of used entries in >> addition to the hash table, and when all entries are in use, remove the >> least recently used one. >> >> The thing to keep an eye on is that it doesn't add too much overhead or >> lock contention in the typical case when there's no concurrent scans. >> >> For the locking, use a LWLock. > > Ok. What would be the potential lock contention in the case of no > concurrent scans? If you only have one seq scan in the system, there's no contention. If you have more, they will all need to acquire the lock to update the page number in the hash table, regardless of the table their scanning, so there's potential for contention. To put things into perspective, though, any time you need to evict a buffer from the buffer cache to read in a new buffer, you need to acquire the BufFreelistLock. If we only update the page number say every 10 pages, the overhead and lock contention of that lock would be roughly 1/10th of that arising from BufFreelistLock. And we can make it a conditional acquire, in which case the backends won't need to slow down their scans, but the updates of the entries in the hash table will get less frequent instead. We could also take just a shared lock to update the counter, and rely on the atomicity of the write like your patch did. You'd only need to take an exclusive lock to move entries in the LRU list or to add or remove an entry. But let's keep it simple at first, and do some testing.. > Also, is it easy to determine the space used by a dynahash with N > entries? I haven't looked at the dynahash code yet, so perhaps this will > be obvious. Yes, see hash_estimate_size. BTW: Thanks for the patch! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote: > Jeff Davis wrote: > > On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote: > >> Jeff Davis wrote: > >>> What should be the maximum size of this hash table? > >> Good question. And also, how do you remove entries from it? > >> > >> I guess the size should somehow be related to number of backends. Each > >> backend will realistically be doing just 1 or max 2 seq scan at a time. > >> It also depends on the number of large tables in the databases, but we > >> don't have that information easily available. How about using just > >> NBackends? That should be plenty, but wasting a few hundred bytes of > >> memory won't hurt anyone. > > > > One entry per relation, not per backend, is my current design. > > Umm, you naturally have just entry per relation, but we were talking > about how many entries the table needs to hold.. You're patch had a > hard-coded value of 1000 which is quite arbitrary. I think I'm missing something from your statement "I guess the size should somehow be related to number of backends." If there is only one entry per relation, why do more backends require a larger hash table? > If you only have one seq scan in the system, there's no contention. If > you have more, they will all need to acquire the lock to update the page > number in the hash table, regardless of the table their scanning, so > there's potential for contention. > > To put things into perspective, though, any time you need to evict a > buffer from the buffer cache to read in a new buffer, you need to > acquire the BufFreelistLock. If we only update the page number say every > 10 pages, the overhead and lock contention of that lock would be roughly > 1/10th of that arising from BufFreelistLock. And we can make it a > conditional acquire, in which case the backends won't need to slow down > their scans, but the updates of the entries in the hash table will get > less frequent instead. We could also take just a shared lock to update > the counter, and rely on the atomicity of the write like your patch did. > You'd only need to take an exclusive lock to move entries in the LRU > list or to add or remove an entry. > If I just step back for a second to consider a simpler design: We will most likely have no more than a few relations larger than the minimum threshold for Sync Scanning in any database being scanned concurrently. What if I just make an LRU that occupies a fixed size? Any reads or updates can start at the MRU item and work backward, and any evictions can happen at the LRU item. Does a hash table really save us cycles if we keep the list small, say, 100 items? Looking at all 100 items would only be required perhaps on a scan startup. I don't have a good sense of scale here, is following 50 pointers while holding a lock a significant source of contention? 100 is of course arbitrary, and that could grow or shrink automatically at runtime. > Yes, see hash_estimate_size. > Ok. Regards,Jeff Davis
On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote: > Umm, you naturally have just entry per relation, but we were talking > about how many entries the table needs to hold.. You're patch had a > hard-coded value of 1000 which is quite arbitrary. We need to think of the interaction with partitioning here. People will ask whether we would recommend that individual partitions of a large table should be larger/smaller than a particular size, to allow these optimizations to kick in. My thinking is that database designers would attempt to set partition size larger than the sync scan limit, whatever it is. That means: - they wouldn't want the limit to vary when cache increases, so we *do* need a GUC to control the limit. My suggestion now would be large_scan_threshold, since it effects both caching and synch scans. - so there will be lots of partitions, so a hardcoded limit of 1000 would not be sufficient. A new GUC, or a link to an existing one, is probably required. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Jeff Davis wrote: > On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote: >> Jeff Davis wrote: >>> On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote: >>>> Jeff Davis wrote: >>>>> What should be the maximum size of this hash table? >>>> Good question. And also, how do you remove entries from it? >>>> >>>> I guess the size should somehow be related to number of backends. Each >>>> backend will realistically be doing just 1 or max 2 seq scan at a time. >>>> It also depends on the number of large tables in the databases, but we >>>> don't have that information easily available. How about using just >>>> NBackends? That should be plenty, but wasting a few hundred bytes of >>>> memory won't hurt anyone. >>> One entry per relation, not per backend, is my current design. >> Umm, you naturally have just entry per relation, but we were talking >> about how many entries the table needs to hold.. You're patch had a >> hard-coded value of 1000 which is quite arbitrary. > > I think I'm missing something from your statement "I guess the size > should somehow be related to number of backends." If there is only one > entry per relation, why do more backends require a larger hash table? The hash table keeps track of ongoing seq scans. That's presumably related to number of backends; I can't imagine a plan where a backend is executing more than one seq scan at a time, so that gives us an upper bound of NBackends simultaneous seq scans in the system. Sure, if you only have say 5 tables that are large enough that we try to keep track of them, there can't be more than 5 entries in the table. But we don't know how many such tables there is at startup time. A hard coded constant in the range of 10 - 100 might be just fine in practice. > If I just step back for a second to consider a simpler design: > > We will most likely have no more than a few relations larger than the > minimum threshold for Sync Scanning in any database being scanned > concurrently. Note that the list would be shared with all databases in the cluster. Your point is still valid, though. > What if I just make an LRU that occupies a fixed size? Any reads or > updates can start at the MRU item and work backward, and any evictions > can happen at the LRU item. > > Does a hash table really save us cycles if we keep the list small, say, > 100 items? Looking at all 100 items would only be required perhaps on a > scan startup. I don't have a good sense of scale here, is following 50 > pointers while holding a lock a significant source of contention? Hmm. If you only need to scan through it when you start the scan, I suppose it would be ok. > 100 is of course arbitrary, and that could grow or shrink automatically > at runtime. Except that it can't shrink, because we don't have any convenient moment to remove an entry from the list, other than dropping the least recently used entry out of the list when inserting a new one. And since it's in shared mem, the allocated size is fixed at boot up, so we can't grow it either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > We need to think of the interaction with partitioning here. People will > ask whether we would recommend that individual partitions of a large > table should be larger/smaller than a particular size, to allow these > optimizations to kick in. > > My thinking is that database designers would attempt to set partition > size larger than the sync scan limit, whatever it is. That means: > - they wouldn't want the limit to vary when cache increases, so we *do* > need a GUC to control the limit. My suggestion now would be > large_scan_threshold, since it effects both caching and synch scans. They wouldn't? If you add more memory to your box, so that a table that didn't fit in memory before does now fit, surely you want to switch your strategy from "don't pollute the cache because it won't fit anyway" to "let's keep it in cache, so the next scan won't do I/O". The basic problem with partitions is that if you have a query like "SELECT * FROM partitioned_table", so that you seq scan multiple partitions, the size of each partition alone could be below the threshold, whatever that is, but since you're scanning them all the net result is the same as scanning one large table above the threshold. The same could happen in any plan that does multiple seq scans. It could even happen at the application level if the application submits multiple statements like "SELECT * FROM table1", "SELECT * FROM table2" one after each other. One way to address that would be to manage the recycle buffer ring size dynamically. When a backend gets a cache miss, the ring would shrink, and when you get cache hits, it would grow. That way, if you do a seq scan on a table that fits in cache repeatedly, that table would get more buffers from the cache on each iteration until it's completely in cache. But if the table or tables being scanned are too big to fit in cache, the ring would stay small and not pollute the cache much. I'm not going to implement that for now, I'm seeing some scary negative feedback behavior with that, and it'd need a lot of testing anyway. I'm thinking of just using shared_buffers as the limit. One could argue for effective_cache_size as well. > - so there will be lots of partitions, so a hardcoded limit of 1000 > would not be sufficient. A new GUC, or a link to an existing one, is > probably required. No matter how many partitions you have, each backend could still be scanning only one of them at a time. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-05-03 at 09:25 +0100, Heikki Linnakangas wrote: > The hash table keeps track of ongoing seq scans. That's presumably > related to number of backends; I can't imagine a plan where a backend is > executing more than one seq scan at a time, so that gives us an upper > bound of NBackends simultaneous seq scans in the system. > What I was trying to say before is that, in my design, it keeps track of relations being scanned, not scans happening. So 5 backends scanning the same table would result in one entry that consists of the most recently- read block by any of those backends for that relation. Part of the reason I did this is because, with a Sync Scan, it seems like there may be possible reasons to do multiple seq scans concurrently in the same backend. This is completely contrived, but consider: SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable; I'm not trying to argue for such a plan to exist right now, but if there is a possibility that we'd want to create such plans in the future, I think it's better to track by relation rather than by backend. There is really no reason that I can see to track 5 positions of 5 different backends scanning the same relation. It introduces a lot of new questions, such as: (1) When starting a new scan on relation R, with which backend do we choose to synchronize? (2) If 3/5 of those scans have finished (and there is therefore no point to synchronize with those scans), how do we find the two that are still moving? I think storing one entry per relation being scanned, regardless of the number of backends, nicely avoids both of those questions. > > What if I just make an LRU that occupies a fixed size? Any reads or > > updates can start at the MRU item and work backward, and any evictions > > can happen at the LRU item. > > > > Does a hash table really save us cycles if we keep the list small, say, > > 100 items? Looking at all 100 items would only be required perhaps on a > > scan startup. I don't have a good sense of scale here, is following 50 > > pointers while holding a lock a significant source of contention? > > Hmm. If you only need to scan through it when you start the scan, I > suppose it would be ok. The idea is that the list would never grow longer than the total number of tables that are larger than sync_seqscan_threshold, and would have some maximum (to fit into fixed-size shared memory). The list would be a bi-directional LRU list (MRU at head, LRU at tail). When evicting an relation from the list due to space exhaustion, it would delete the tail of the list. When a new scan starts and it's already in the list, it would most likely just need to read a few of the hot elements at the head of the list before it found the relation in question. When it's not in the list, the scan may need to read all the way through to see that it's not there. > > 100 is of course arbitrary, and that could grow or shrink automatically > > at runtime. > > Except that it can't shrink, because we don't have any convenient moment > to remove an entry from the list, other than dropping the least recently > used entry out of the list when inserting a new one. And since it's in > shared mem, the allocated size is fixed at boot up, so we can't grow it > either. > It can shrink when a new scan looks through the list and passes by entries that don't need to be in the list. I don't want to over-complicate things, so if you think having a hash table is a better, simpler, or more readable design, I'll go with that instead of the list-only approach. The only reason I suggested the list- only approach was to see if the hash table was really gaining anything for us. In practice, it will be quite rare that more than 10 different big relations are being scanned at once, and I don't know how much of a gain a hash table is when there are (in all but exceptional cases) only a few entries. Regards,Jeff Davis
On Thu, 2007-05-03 at 08:01 +0100, Simon Riggs wrote: > On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote: > > > Umm, you naturally have just entry per relation, but we were talking > > about how many entries the table needs to hold.. You're patch had a > > hard-coded value of 1000 which is quite arbitrary. > > We need to think of the interaction with partitioning here. People will > ask whether we would recommend that individual partitions of a large > table should be larger/smaller than a particular size, to allow these > optimizations to kick in. > > My thinking is that database designers would attempt to set partition > size larger than the sync scan limit, whatever it is. That means: > - they wouldn't want the limit to vary when cache increases, so we *do* > need a GUC to control the limit. My suggestion now would be > large_scan_threshold, since it effects both caching and synch scans. > - so there will be lots of partitions, so a hardcoded limit of 1000 > would not be sufficient. A new GUC, or a link to an existing one, is > probably required. > That's a very good point. I don't know how much we can do to fix it now though, because that has interactions with the planner too: the planner "should" choose to UNION ALL the relations in an order dependent on other concurrent queries. I think this will require more thought. To address the idea of scaling to more relations being concurrently scanned I could use Heikki's recommendation of a dynamic hash table. Regards,Jeff Davis
Jeff Davis wrote: > What I was trying to say before is that, in my design, it keeps track of > relations being scanned, not scans happening. So 5 backends scanning the > same table would result in one entry that consists of the most recently- > read block by any of those backends for that relation. We're talking across each other... I understand that the data structure keeps track of relations being scanned, with one entry per such relation. I think that's very good design, and I'm not suggesting to change that. But what's the right size for that? We don't know how many large tables there is in the database, so we can't use that. A hard-coded constant maybe? What would it be? I figured we could use NBackends, because there can't realistically be more than one large seq scan per backend going on at any point in time. That's an upper bound, in any real world application there's far less than that. > Part of the reason I did this is because, with a Sync Scan, it seems > like there may be possible reasons to do multiple seq scans concurrently > in the same backend. This is completely contrived, but consider: > > SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable; > > I'm not trying to argue for such a plan to exist right now, but if there > is a possibility that we'd want to create such plans in the future, I > think it's better to track by relation rather than by backend. Those two seqscans would be executed one after each other, not in parallel, so you would still have just one seq scan at any given point in time. > (1) When starting a new scan on relation R, with which backend do we > choose to synchronize? > (2) If 3/5 of those scans have finished (and there is therefore no point > to synchronize with those scans), how do we find the two that are still > moving? > > I think storing one entry per relation being scanned, regardless of the > number of backends, nicely avoids both of those questions. Yeah, agreed. That's a good design. >>> What if I just make an LRU that occupies a fixed size? Any reads or >>> updates can start at the MRU item and work backward, and any evictions >>> can happen at the LRU item. >>> >>> Does a hash table really save us cycles if we keep the list small, say, >>> 100 items? Looking at all 100 items would only be required perhaps on a >>> scan startup. I don't have a good sense of scale here, is following 50 >>> pointers while holding a lock a significant source of contention? >> Hmm. If you only need to scan through it when you start the scan, I >> suppose it would be ok. > > The idea is that the list would never grow longer than the total number > of tables that are larger than sync_seqscan_threshold, and would have > some maximum (to fit into fixed-size shared memory). The list would be a > bi-directional LRU list (MRU at head, LRU at tail). When evicting an > relation from the list due to space exhaustion, it would delete the tail > of the list. When a new scan starts and it's already in the list, it > would most likely just need to read a few of the hot elements at the > head of the list before it found the relation in question. When it's not > in the list, the scan may need to read all the way through to see that > it's not there. Yeah, that works well when there's just a few hot elements in the list. I'm not sure how large the list would need to grow until scanning it would become a performance bottleneck, but I suppose a list of 5-10 elements would be perfectly fine. But is that enough? >>> 100 is of course arbitrary, and that could grow or shrink automatically >>> at runtime. >> Except that it can't shrink, because we don't have any convenient moment >> to remove an entry from the list, other than dropping the least recently >> used entry out of the list when inserting a new one. And since it's in >> shared mem, the allocated size is fixed at boot up, so we can't grow it >> either. >> > > It can shrink when a new scan looks through the list and passes by > entries that don't need to be in the list. How do you identify an entry that doesn't need to be there? > I don't want to over-complicate things, so if you think having a hash > table is a better, simpler, or more readable design, I'll go with that > instead of the list-only approach. The only reason I suggested the list- > only approach was to see if the hash table was really gaining anything > for us. In practice, it will be quite rare that more than 10 different > big relations are being scanned at once, and I don't know how much of a > gain a hash table is when there are (in all but exceptional cases) only > a few entries. How about implementing the list-only approach first, since you're going to need the LRU list anyway, and we can add consider adding the hash table later? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-05-03 at 19:27 +0100, Heikki Linnakangas wrote: > I understand that the data structure keeps track of relations being > scanned, with one entry per such relation. I think that's very good > design, and I'm not suggesting to change that. > > But what's the right size for that? We don't know how many large tables > there is in the database, so we can't use that. A hard-coded constant > maybe? What would it be? I figured we could use NBackends, because there > can't realistically be more than one large seq scan per backend going on > at any point in time. That's an upper bound, in any real world > application there's far less than that. Ok, I understand you now. We'll choose the maximum size of the table/list as the max_connections on startup. > > Part of the reason I did this is because, with a Sync Scan, it seems > > like there may be possible reasons to do multiple seq scans concurrently > > in the same backend. This is completely contrived, but consider: > > > > SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable; > > > > I'm not trying to argue for such a plan to exist right now, but if there > > is a possibility that we'd want to create such plans in the future, I > > think it's better to track by relation rather than by backend. > > Those two seqscans would be executed one after each other, not in > parallel, so you would still have just one seq scan at any given point > in time. Right, I was just throwing out a hypothetical situation (if highly contrived) in which it may help to do multiple scans at once. I think this can be safely ignored for now though, since the planner would never generate such a plan, and I can't think of a reasonable use case. > >>> 100 is of course arbitrary, and that could grow or shrink automatically > >>> at runtime. > >> Except that it can't shrink, because we don't have any convenient moment > >> to remove an entry from the list, other than dropping the least recently > >> used entry out of the list when inserting a new one. And since it's in > >> shared mem, the allocated size is fixed at boot up, so we can't grow it > >> either. > >> > > > > It can shrink when a new scan looks through the list and passes by > > entries that don't need to be in the list. > > How do you identify an entry that doesn't need to be there? > Some type of clock-like algorithm where a counter was decremented when an element is passed up and incremented when it's chosen might work. It's probably easier to just use a hash table though. > > I don't want to over-complicate things, so if you think having a hash > > table is a better, simpler, or more readable design, I'll go with that > > instead of the list-only approach. The only reason I suggested the list- > > only approach was to see if the hash table was really gaining anything > > for us. In practice, it will be quite rare that more than 10 different > > big relations are being scanned at once, and I don't know how much of a > > gain a hash table is when there are (in all but exceptional cases) only > > a few entries. > > How about implementing the list-only approach first, since you're going > to need the LRU list anyway, and we can add consider adding the hash > table later? > Sounds good. Will do. Regards,Jeff Davis