Thread: Sequential scans

Sequential scans

From

Heikki Linnakangas

Date:

02 May 2007, 10:27:05

Hi,

I'm starting to review the "synchronized scans" and "scan-resistant 
buffer cache" patches. The patches have complex interactions so I'm 
taking a holistic approach.

There's four outstanding issues with the sync scans in particular:

1. The simplistic hash approach. While it's nice to not have a lock, I'm 
worried of collisions. If you had a collision every now and then, it 
wouldn't be that bad, but because the hash value is computed from the 
oid, a collision would be persistent. If you create a database and 
happen to have two frequently seqscanned tables that collide, the only 
way to get rid of the collision is to drop and recreate a table. 
Granted, that'd probably be very rare in practice, but when it happens 
it would be next to impossible to figure out what's going on.

Let's use a normal hash table instead, and use a lock to protect it. If 
we only update it every 10 pages or so, the overhead should be 
negligible. To further reduce contention, we could modify ReadBuffer to 
let the caller know if the read resulted in a physical read or not, and 
only update the entry when a page is physically read in. That way all 
the synchronized scanners wouldn't be updating the same value, just the 
one performing the I/O. And while we're at it, let's use the full 
relfilenode instead of just the table oid in the hash.

2. Under what circumstances does the patch help and when does it hurt? I 
think the patch is safe in that it should never be any worse than what 
we have now. But when does it help? That needs to be looked at together 
with the other patch.

I need to dig the archives for the performance test results you posted 
earlier and try to understand them.

There's six distinct scenarios I've come up with this far that need to 
be looked at:
A. A seq scan on a small table
B. A seq scan on a table that's 110% the size of shared_buffers, but 
smaller than RAM
C. A seq scan on a table that's 110% the size of RAM
D. A seq scan on a huge table
E. Two simultaneous seq scans on a large table starting at the same time
F. Two simultaneous seq scans on a large table, 2nd one starting when 
the 1st one is halfway through

Also, does it change things if you have a bulk update instead of 
read-only query? How about bitmap heap scans and large index scans? And 
vacuums? And the above scenarios need to be considered both alone, and 
in the presence of other OLTP kind of workload.

I realize that we can't have everything, and as long as we get some 
significant benefit in some scenarios, and don't hurt others, the patch 
is worthwhile. But let's try to cover as much as we reasonably can.

One random idea I had to cover B & C without having the offset variable: 
Start scanning *backwards* from the page that's in the shared hash 
table, until you hit a page that's not in buffer cache. Then you 
continue scanning forwards from the page you started from.

This needs more thought but I think we can come up with a pretty simple 
solution that covers the most common cases.

3. By having different backends doing the reads, are we destroying OS 
readahead as Tom suggested? I remember you performed some tests on that, 
and it was a problem on some systems but not on others. This needs some 
thought, there may be some simple way to address that.

4. It fails regression tests. You get an assertion failure on the portal 
test. I believe that changing the direction of a scan isn't handled 
properly; it's probably pretty easy to fix.

Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought, 
and take a closer look at the scan-resistant scans patch. Any comments 
and ideas are welcome, of course..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

02 May 2007, 14:52:27

On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
> Hi,
> 
> I'm starting to review the "synchronized scans" and "scan-resistant 
> buffer cache" patches. The patches have complex interactions so I'm 
> taking a holistic approach.
> 
> There's four outstanding issues with the sync scans in particular:
> 
> 1. The simplistic hash approach. While it's nice to not have a lock, I'm 
> worried of collisions. If you had a collision every now and then, it 
> wouldn't be that bad, but because the hash value is computed from the 
> oid, a collision would be persistent. If you create a database and 
> happen to have two frequently seqscanned tables that collide, the only 
> way to get rid of the collision is to drop and recreate a table. 
> Granted, that'd probably be very rare in practice, but when it happens 
> it would be next to impossible to figure out what's going on.
> 
> Let's use a normal hash table instead, and use a lock to protect it. If 
> we only update it every 10 pages or so, the overhead should be 
> negligible. To further reduce contention, we could modify ReadBuffer to 
> let the caller know if the read resulted in a physical read or not, and 
> only update the entry when a page is physically read in. That way all 
> the synchronized scanners wouldn't be updating the same value, just the 
> one performing the I/O. And while we're at it, let's use the full 
> relfilenode instead of just the table oid in the hash.

What should be the maximum size of this hash table? Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?

I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?

> 3. By having different backends doing the reads, are we destroying OS 
> readahead as Tom suggested? I remember you performed some tests on that, 
> and it was a problem on some systems but not on others. This needs some 
> thought, there may be some simple way to address that.

Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.

Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.

Of course, my tests are limited and there are many possible combinations
of I/O systems that I did not try. 

> 4. It fails regression tests. You get an assertion failure on the portal 
> test. I believe that changing the direction of a scan isn't handled 
> properly; it's probably pretty easy to fix.
> 

I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?

> Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought, 
> and take a closer look at the scan-resistant scans patch. Any comments 
> and ideas are welcome, of course..
> 

Yes. I'll also try to address the other issues in your email. Thanks for
your comments.

Regards,Jeff Davis

Re: Sequential scans

From

Gregory Stark

Date:

02 May 2007, 14:54:59

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Let's use a normal hash table instead, and use a lock to protect it. If we only
> update it every 10 pages or so, the overhead should be negligible. To further
> reduce contention, we could modify ReadBuffer to let the caller know if the
> read resulted in a physical read or not, and only update the entry when a page
> is physically read in. That way all the synchronized scanners wouldn't be
> updating the same value, just the one performing the I/O. And while we're at
> it, let's use the full relfilenode instead of just the table oid in the hash.

It's probably fine to just do that. But if we find it's a performance
bottleneck we could probably still manage to avoid the lock except when
actually inserting a new hash element. If you just store in the hash an index
into an array stored in global memory then you could get away without a lock
on the element in the array. 

It starts to get to be a fair amount of code when you think about how you
would reuse elements of the array. That's why I suggest only looking at this
if down the road we find that it's a bottleneck.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Sequential scans

From

Heikki Linnakangas

Date:

02 May 2007, 16:03:21

Gregory Stark wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> 
>> Let's use a normal hash table instead, and use a lock to protect it. If we only
>> update it every 10 pages or so, the overhead should be negligible. To further
>> reduce contention, we could modify ReadBuffer to let the caller know if the
>> read resulted in a physical read or not, and only update the entry when a page
>> is physically read in. That way all the synchronized scanners wouldn't be
>> updating the same value, just the one performing the I/O. And while we're at
>> it, let's use the full relfilenode instead of just the table oid in the hash.
> 
> It's probably fine to just do that. But if we find it's a performance
> bottleneck we could probably still manage to avoid the lock except when
> actually inserting a new hash element. If you just store in the hash an index
> into an array stored in global memory then you could get away without a lock
> on the element in the array. 
> 
> It starts to get to be a fair amount of code when you think about how you
> would reuse elements of the array. That's why I suggest only looking at this
> if down the road we find that it's a bottleneck.

Another trick you could do is to use acquire the lock conditionally when 
updating it. But I doubt it's a problem anyhow, if we put some sane 
lower limit in there so that it's not used at all for small tables.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Heikki Linnakangas

Date:

02 May 2007, 16:58:59

Jeff Davis wrote:
> On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
>> Let's use a normal hash table instead, and use a lock to protect it. If 
>> we only update it every 10 pages or so, the overhead should be 
>> negligible. To further reduce contention, we could modify ReadBuffer to 
>> let the caller know if the read resulted in a physical read or not, and 
>> only update the entry when a page is physically read in. That way all 
>> the synchronized scanners wouldn't be updating the same value, just the 
>> one performing the I/O. And while we're at it, let's use the full 
>> relfilenode instead of just the table oid in the hash.
> 
> What should be the maximum size of this hash table? 

Good question. And also, how do you remove entries from it?

I guess the size should somehow be related to number of backends. Each 
backend will realistically be doing just 1 or max 2 seq scan at a time.  It also depends on the number of large tables
inthe databases, but we 

don't have that information easily available. How about using just 
NBackends? That should be plenty, but wasting a few hundred bytes of 
memory won't hurt anyone.

I think you're going to need an LRU list and counter of used entries in 
addition to the hash table, and when all entries are in use, remove the 
least recently used one.

The thing to keep an eye on is that it doesn't add too much overhead or 
lock contention in the typical case when there's no concurrent scans.

For the locking, use a LWLock.

> Is there already-
> existing hash table code that I should use to be consistent with the
> rest of the code?

Yes, see utils/hash/dynahash.c, and ShmemInitHash (in 
storage/ipc/shmem.c) since it's in shared memory. There's plenty of 
examples that use hash tables, see for example 
storage/freespace/freespace.c.

> I'm still trying to understand the effect of using the full relfilenode.
> Do you mean using the entire relation _segment_ as the key? That doesn't
> make sense to me. Or do you just mean using the relfilenode (without the
> segment) as the key?

No, not the segment. RelFileNode consists of tablespace oid, database 
oid and relation oid. You can find it in scan->rs_rd->rd_node. The 
segmentation works at a lower level.

> Linux with CFQ I/O scheduler performs very poorly and inconsistently
> with concurrent sequential scans regardless of whether the scans are
> synchronized or not. I suspect the reason for this is that CFQ is
> designed to care more about the process issuing the request than any
> other factor.
> 
> Every other I/O system performed either ideally (no interference between
> scans) or had some interference but still much better than current
> behavior.

Hmm. Should we care then? CFG is the default on Linux, and an average 
sysadmin is unlikely to change it.

What we could do quite easily is:
- when ReadBuffer is called, let the caller know if the read did 
physical I/O.
- when the previous ReadBuffer didn't result in physical I/O, assume 
that we're not the pack leader. If the next buffer isn't already in 
cache, wait a few milliseconds before initiating the read, giving the 
pack leader a chance to do it instead.

Needs testing, of course..

>> 4. It fails regression tests. You get an assertion failure on the portal 
>> test. I believe that changing the direction of a scan isn't handled 
>> properly; it's probably pretty easy to fix.
>>
> 
> I will examine the code more carefully. As a first guess, is it possible
> that test is failing because of the non-deterministic order in which
> tuples are returned?

No, it's an assertion failure, not just different output than expected. 
But it's probably quite simple to fix..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

02 May 2007, 18:11:57

On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > 
> >> Let's use a normal hash table instead, and use a lock to protect it. If we only
> >> update it every 10 pages or so, the overhead should be negligible. To further
> >> reduce contention, we could modify ReadBuffer to let the caller know if the
> >> read resulted in a physical read or not, and only update the entry when a page
> >> is physically read in. That way all the synchronized scanners wouldn't be
> >> updating the same value, just the one performing the I/O. And while we're at
> >> it, let's use the full relfilenode instead of just the table oid in the hash.
> > 
> > It's probably fine to just do that. But if we find it's a performance
> > bottleneck we could probably still manage to avoid the lock except when
> > actually inserting a new hash element. If you just store in the hash an index
> > into an array stored in global memory then you could get away without a lock
> > on the element in the array. 
> > 
> > It starts to get to be a fair amount of code when you think about how you
> > would reuse elements of the array. That's why I suggest only looking at this
> > if down the road we find that it's a bottleneck.
> 
> Another trick you could do is to use acquire the lock conditionally when 
> updating it. But I doubt it's a problem anyhow, if we put some sane 
> lower limit in there so that it's not used at all for small tables.
> 

The more sophisticated the data structure the less able we are to avoid
locking, correct? For instance, if we have an LRU list it might be
tricky or impossible to avoid locking even on just reads.

Regards,Jeff Davis

Re: Sequential scans

From

Heikki Linnakangas

Date:

02 May 2007, 18:28:59

Jeff Davis wrote:
> On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>>
>>>> Let's use a normal hash table instead, and use a lock to protect it. If we only
>>>> update it every 10 pages or so, the overhead should be negligible. To further
>>>> reduce contention, we could modify ReadBuffer to let the caller know if the
>>>> read resulted in a physical read or not, and only update the entry when a page
>>>> is physically read in. That way all the synchronized scanners wouldn't be
>>>> updating the same value, just the one performing the I/O. And while we're at
>>>> it, let's use the full relfilenode instead of just the table oid in the hash.
>>> It's probably fine to just do that. But if we find it's a performance
>>> bottleneck we could probably still manage to avoid the lock except when
>>> actually inserting a new hash element. If you just store in the hash an index
>>> into an array stored in global memory then you could get away without a lock
>>> on the element in the array. 
>>>
>>> It starts to get to be a fair amount of code when you think about how you
>>> would reuse elements of the array. That's why I suggest only looking at this
>>> if down the road we find that it's a bottleneck.
>> Another trick you could do is to use acquire the lock conditionally when 
>> updating it. But I doubt it's a problem anyhow, if we put some sane 
>> lower limit in there so that it's not used at all for small tables.
>>
> 
> The more sophisticated the data structure the less able we are to avoid
> locking, correct? For instance, if we have an LRU list it might be
> tricky or impossible to avoid locking even on just reads.

Agreed. I'm not concerned about reads, though. You only need to read 
from the structure once when you start a scan. It's the updates that 
cause most of the traffic.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

02 May 2007, 19:37:09

On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > What should be the maximum size of this hash table? 
> 
> Good question. And also, how do you remove entries from it?
> 
> I guess the size should somehow be related to number of backends. Each 
> backend will realistically be doing just 1 or max 2 seq scan at a time. 
>   It also depends on the number of large tables in the databases, but we 
> don't have that information easily available. How about using just 
> NBackends? That should be plenty, but wasting a few hundred bytes of 
> memory won't hurt anyone.

One entry per relation, not per backend, is my current design.

> I think you're going to need an LRU list and counter of used entries in 
> addition to the hash table, and when all entries are in use, remove the 
> least recently used one.
> 
> The thing to keep an eye on is that it doesn't add too much overhead or 
> lock contention in the typical case when there's no concurrent scans.
> 
> For the locking, use a LWLock.
> 

Ok. What would be the potential lock contention in the case of no
concurrent scans?

Also, is it easy to determine the space used by a dynahash with N
entries? I haven't looked at the dynahash code yet, so perhaps this will
be obvious.

> No, not the segment. RelFileNode consists of tablespace oid, database 
> oid and relation oid. You can find it in scan->rs_rd->rd_node. The 
> segmentation works at a lower level.

Ok, will do.

> Hmm. Should we care then? CFG is the default on Linux, and an average 
> sysadmin is unlikely to change it.
> 

Keep in mind that concurrent sequential scans with CFQ are *already*
very poor. I think that alone is an interesting fact that's somewhat
independent of Sync Scans.

> - when ReadBuffer is called, let the caller know if the read did 
> physical I/O.
> - when the previous ReadBuffer didn't result in physical I/O, assume 
> that we're not the pack leader. If the next buffer isn't already in 
> cache, wait a few milliseconds before initiating the read, giving the 
> pack leader a chance to do it instead.
> 
> Needs testing, of course..
> 

An interesting idea. I like that the most out of the ideas of
maintaining a "pack leader". That's very similar to what the Linux
anticipatory scheduler does for us.

> >> 4. It fails regression tests. You get an assertion failure on the portal 
> >> test. I believe that changing the direction of a scan isn't handled 
> >> properly; it's probably pretty easy to fix.
> >>
> > 
> > I will examine the code more carefully. As a first guess, is it possible
> > that test is failing because of the non-deterministic order in which
> > tuples are returned?
> 
> No, it's an assertion failure, not just different output than expected. 
> But it's probably quite simple to fix..
> 

Ok, I'll find and correct it then.

Regards,Jeff Davis

Re: Sequential scans

From

Heikki Linnakangas

Date:

02 May 2007, 20:00:32

Jeff Davis wrote:
> On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
>> Jeff Davis wrote:
>>> What should be the maximum size of this hash table? 
>> Good question. And also, how do you remove entries from it?
>>
>> I guess the size should somehow be related to number of backends. Each 
>> backend will realistically be doing just 1 or max 2 seq scan at a time. 
>>   It also depends on the number of large tables in the databases, but we 
>> don't have that information easily available. How about using just 
>> NBackends? That should be plenty, but wasting a few hundred bytes of 
>> memory won't hurt anyone.
> 
> One entry per relation, not per backend, is my current design.

Umm, you naturally have just entry per relation, but we were talking 
about how many entries the table needs to hold.. You're patch had a 
hard-coded value of 1000 which is quite arbitrary.

>> I think you're going to need an LRU list and counter of used entries in 
>> addition to the hash table, and when all entries are in use, remove the 
>> least recently used one.
>>
>> The thing to keep an eye on is that it doesn't add too much overhead or 
>> lock contention in the typical case when there's no concurrent scans.
>>
>> For the locking, use a LWLock.
> 
> Ok. What would be the potential lock contention in the case of no
> concurrent scans?

If you only have one seq scan in the system, there's no contention. If 
you have more, they will all need to acquire the lock to update the page 
number in the hash table, regardless of the table their scanning, so 
there's potential for contention.

To put things into perspective, though, any time you need to evict a 
buffer from the buffer cache to read in a new buffer, you need to 
acquire the BufFreelistLock. If we only update the page number say every 
10 pages, the overhead and lock contention of that lock would be roughly 
1/10th of that arising from BufFreelistLock. And we can make it a 
conditional acquire, in which case the backends won't need to slow down 
their scans, but the updates of the entries in the hash table will get 
less frequent instead. We could also take just a shared lock to update 
the counter, and rely on the atomicity of the write like your patch did. 
You'd only need to take an exclusive lock to move entries in the LRU 
list or to add or remove an entry.

But let's keep it simple at first, and do some testing..

> Also, is it easy to determine the space used by a dynahash with N
> entries? I haven't looked at the dynahash code yet, so perhaps this will
> be obvious.

Yes, see hash_estimate_size.

BTW: Thanks for the patch!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

02 May 2007, 21:24:30

On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
> >> Jeff Davis wrote:
> >>> What should be the maximum size of this hash table? 
> >> Good question. And also, how do you remove entries from it?
> >>
> >> I guess the size should somehow be related to number of backends. Each 
> >> backend will realistically be doing just 1 or max 2 seq scan at a time. 
> >>   It also depends on the number of large tables in the databases, but we 
> >> don't have that information easily available. How about using just 
> >> NBackends? That should be plenty, but wasting a few hundred bytes of 
> >> memory won't hurt anyone.
> > 
> > One entry per relation, not per backend, is my current design.
> 
> Umm, you naturally have just entry per relation, but we were talking 
> about how many entries the table needs to hold.. You're patch had a 
> hard-coded value of 1000 which is quite arbitrary.

I think I'm missing something from your statement "I guess the size
should somehow be related to number of backends." If there is only one
entry per relation, why do more backends require a larger hash table?

> If you only have one seq scan in the system, there's no contention. If 
> you have more, they will all need to acquire the lock to update the page 
> number in the hash table, regardless of the table their scanning, so 
> there's potential for contention.
> 
> To put things into perspective, though, any time you need to evict a 
> buffer from the buffer cache to read in a new buffer, you need to 
> acquire the BufFreelistLock. If we only update the page number say every 
> 10 pages, the overhead and lock contention of that lock would be roughly 
> 1/10th of that arising from BufFreelistLock. And we can make it a 
> conditional acquire, in which case the backends won't need to slow down 
> their scans, but the updates of the entries in the hash table will get 
> less frequent instead. We could also take just a shared lock to update 
> the counter, and rely on the atomicity of the write like your patch did. 
> You'd only need to take an exclusive lock to move entries in the LRU 
> list or to add or remove an entry.
> 

If I just step back for a second to consider a simpler design:

We will most likely have no more than a few relations larger than the
minimum threshold for Sync Scanning in any database being scanned
concurrently.

What if I just make an LRU that occupies a fixed size? Any reads or
updates can start at the MRU item and work backward, and any evictions
can happen at the LRU item.

Does a hash table really save us cycles if we keep the list small, say,
100 items? Looking at all 100 items would only be required perhaps on a
scan startup. I don't have a good sense of scale here, is following 50
pointers while holding a lock a significant source of contention?

100 is of course arbitrary, and that could grow or shrink automatically
at runtime.

> Yes, see hash_estimate_size.
> 

Ok.

Regards,Jeff Davis

Re: Sequential scans

From

"Simon Riggs"

Date:

03 May 2007, 04:28:47

On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:

> Umm, you naturally have just entry per relation, but we were talking 
> about how many entries the table needs to hold.. You're patch had a 
> hard-coded value of 1000 which is quite arbitrary.

We need to think of the interaction with partitioning here. People will
ask whether we would recommend that individual partitions of a large
table should be larger/smaller than a particular size, to allow these
optimizations to kick in.

My thinking is that database designers would attempt to set partition
size larger than the sync scan limit, whatever it is. That means:
- they wouldn't want the limit to vary when cache increases, so we *do*
need a GUC to control the limit. My suggestion now would be
large_scan_threshold, since it effects both caching and synch scans.
- so there will be lots of partitions, so a hardcoded limit of 1000
would not be sufficient. A new GUC, or a link to an existing one, is
probably required.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Heikki Linnakangas

Date:

03 May 2007, 05:25:58

Jeff Davis wrote:
> On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
>> Jeff Davis wrote:
>>> On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
>>>> Jeff Davis wrote:
>>>>> What should be the maximum size of this hash table? 
>>>> Good question. And also, how do you remove entries from it?
>>>>
>>>> I guess the size should somehow be related to number of backends. Each 
>>>> backend will realistically be doing just 1 or max 2 seq scan at a time. 
>>>>   It also depends on the number of large tables in the databases, but we 
>>>> don't have that information easily available. How about using just 
>>>> NBackends? That should be plenty, but wasting a few hundred bytes of 
>>>> memory won't hurt anyone.
>>> One entry per relation, not per backend, is my current design.
>> Umm, you naturally have just entry per relation, but we were talking 
>> about how many entries the table needs to hold.. You're patch had a 
>> hard-coded value of 1000 which is quite arbitrary.
> 
> I think I'm missing something from your statement "I guess the size
> should somehow be related to number of backends." If there is only one
> entry per relation, why do more backends require a larger hash table?

The hash table keeps track of ongoing seq scans. That's presumably 
related to number of backends; I can't imagine a plan where a backend is 
executing more than one seq scan at a time, so that gives us an upper 
bound of NBackends simultaneous seq scans in the system.

Sure, if you only have say 5 tables that are large enough that we try to 
keep track of them, there can't be more than 5 entries in the table. But 
we don't know how many such tables there is at startup time.

A hard coded constant in the range of 10 - 100 might be just fine in 
practice.

> If I just step back for a second to consider a simpler design:
> 
> We will most likely have no more than a few relations larger than the
> minimum threshold for Sync Scanning in any database being scanned
> concurrently.

Note that the list would be shared with all databases in the cluster. 
Your point is still valid, though.

> What if I just make an LRU that occupies a fixed size? Any reads or
> updates can start at the MRU item and work backward, and any evictions
> can happen at the LRU item.
>
> Does a hash table really save us cycles if we keep the list small, say,
> 100 items? Looking at all 100 items would only be required perhaps on a
> scan startup. I don't have a good sense of scale here, is following 50
> pointers while holding a lock a significant source of contention?

Hmm. If you only need to scan through it when you start the scan, I 
suppose it would be ok.

> 100 is of course arbitrary, and that could grow or shrink automatically
> at runtime.

Except that it can't shrink, because we don't have any convenient moment 
to remove an entry from the list, other than dropping the least recently 
used entry out of the list when inserting a new one. And since it's in 
shared mem, the allocated size is fixed at boot up, so we can't grow it 
either.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Heikki Linnakangas

Date:

03 May 2007, 05:55:07

Simon Riggs wrote:
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
> 
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.

They wouldn't? If you add more memory to your box, so that a table that 
didn't fit in memory before does now fit, surely you want to switch your 
strategy from "don't pollute the cache because it won't fit anyway" to 
"let's keep it in cache, so the next scan won't do I/O".

The basic problem with partitions is that if you have a query like 
"SELECT * FROM partitioned_table", so that you seq scan multiple 
partitions, the size of each partition alone could be below the 
threshold, whatever that is, but since you're scanning them all the net 
result is the same as scanning one large table above the threshold. The 
same could happen in any plan that does multiple seq scans. It could 
even happen at the application level if the application submits multiple 
statements like "SELECT * FROM table1", "SELECT * FROM table2" one after 
each other.

One way to address that would be to manage the recycle buffer ring size 
dynamically. When a backend gets a cache miss, the ring would shrink, 
and when you get cache hits, it would grow. That way, if you do a seq 
scan on a table that fits in cache repeatedly, that table would get more 
buffers from the cache on each iteration until it's completely in cache. 
But if the table or tables being scanned are too big to fit in cache, 
the ring would stay small and not pollute the cache much.

I'm not going to implement that for now, I'm seeing some scary negative 
feedback behavior with that, and it'd need a lot of testing anyway. I'm 
thinking of just using shared_buffers as the limit. One could argue for 
effective_cache_size as well.

> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.

No matter how many partitions you have, each backend could still be 
scanning only one of them at a time.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

03 May 2007, 14:54:22

On Thu, 2007-05-03 at 09:25 +0100, Heikki Linnakangas wrote:
> The hash table keeps track of ongoing seq scans. That's presumably 
> related to number of backends; I can't imagine a plan where a backend is 
> executing more than one seq scan at a time, so that gives us an upper 
> bound of NBackends simultaneous seq scans in the system.
> 

What I was trying to say before is that, in my design, it keeps track of
relations being scanned, not scans happening. So 5 backends scanning the
same table would result in one entry that consists of the most recently-
read block by any of those backends for that relation.

Part of the reason I did this is because, with a Sync Scan, it seems
like there may be possible reasons to do multiple seq scans concurrently
in the same backend. This is completely contrived, but consider:

SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;

I'm not trying to argue for such a plan to exist right now, but if there
is a possibility that we'd want to create such plans in the future, I
think it's better to track by relation rather than by backend. 

There is really no reason that I can see to track 5 positions of 5
different backends scanning the same relation. It introduces a lot of
new questions, such as: 

(1) When starting a new scan on relation R, with which backend do we
choose to synchronize? 
(2) If 3/5 of those scans have finished (and there is therefore no point
to synchronize with those scans), how do we find the two that are still
moving?

I think storing one entry per relation being scanned, regardless of the
number of backends, nicely avoids both of those questions.

> > What if I just make an LRU that occupies a fixed size? Any reads or
> > updates can start at the MRU item and work backward, and any evictions
> > can happen at the LRU item.
> >
> > Does a hash table really save us cycles if we keep the list small, say,
> > 100 items? Looking at all 100 items would only be required perhaps on a
> > scan startup. I don't have a good sense of scale here, is following 50
> > pointers while holding a lock a significant source of contention?
> 
> Hmm. If you only need to scan through it when you start the scan, I 
> suppose it would be ok.

The idea is that the list would never grow longer than the total number
of tables that are larger than sync_seqscan_threshold, and would have
some maximum (to fit into fixed-size shared memory). The list would be a
bi-directional LRU list (MRU at head, LRU at tail). When evicting an
relation from the list due to space exhaustion, it would delete the tail
of the list. When a new scan starts and it's already in the list, it
would most likely just need to read a few of the hot elements at the
head of the list before it found the relation in question. When it's not
in the list, the scan may need to read all the way through to see that
it's not there.

> > 100 is of course arbitrary, and that could grow or shrink automatically
> > at runtime.
> 
> Except that it can't shrink, because we don't have any convenient moment 
> to remove an entry from the list, other than dropping the least recently 
> used entry out of the list when inserting a new one. And since it's in 
> shared mem, the allocated size is fixed at boot up, so we can't grow it 
> either.
> 

It can shrink when a new scan looks through the list and passes by
entries that don't need to be in the list.

I don't want to over-complicate things, so if you think having a hash
table is a better, simpler, or more readable design, I'll go with that
instead of the list-only approach. The only reason I suggested the list-
only approach was to see if the hash table was really gaining anything
for us. In practice, it will be quite rare that more than 10 different
big relations are being scanned at once, and I don't know how much of a
gain a hash table is when there are (in all but exceptional cases) only
a few entries.

Regards,Jeff Davis

Re: Sequential scans

From

Jeff Davis

Date:

03 May 2007, 15:04:26

On Thu, 2007-05-03 at 08:01 +0100, Simon Riggs wrote:
> On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
> 
> > Umm, you naturally have just entry per relation, but we were talking 
> > about how many entries the table needs to hold.. You're patch had a 
> > hard-coded value of 1000 which is quite arbitrary.
> 
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
> 
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.
> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.
> 

That's a very good point. I don't know how much we can do to fix it now
though, because that has interactions with the planner too: the planner
"should" choose to UNION ALL the relations in an order dependent on
other concurrent queries. I think this will require more thought.

To address the idea of scaling to more relations being concurrently
scanned I could use Heikki's recommendation of a dynamic hash table.

Regards,Jeff Davis

Re: Sequential scans

From

Heikki Linnakangas

Date:

03 May 2007, 15:28:34

Jeff Davis wrote:
> What I was trying to say before is that, in my design, it keeps track of
> relations being scanned, not scans happening. So 5 backends scanning the
> same table would result in one entry that consists of the most recently-
> read block by any of those backends for that relation.

We're talking across each other...

I understand that the data structure keeps track of relations being 
scanned, with one entry per such relation. I think that's very good 
design, and I'm not suggesting to change that.

But what's the right size for that? We don't know how many large tables 
there is in the database, so we can't use that. A hard-coded constant 
maybe? What would it be? I figured we could use NBackends, because there 
can't realistically be more than one large seq scan per backend going on 
at any point in time. That's an upper bound, in any real world 
application there's far less than that.

> Part of the reason I did this is because, with a Sync Scan, it seems
> like there may be possible reasons to do multiple seq scans concurrently
> in the same backend. This is completely contrived, but consider:
> 
> SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;
> 
> I'm not trying to argue for such a plan to exist right now, but if there
> is a possibility that we'd want to create such plans in the future, I
> think it's better to track by relation rather than by backend. 

Those two seqscans would be executed one after each other, not in 
parallel, so you would still have just one seq scan at any given point 
in time.

> (1) When starting a new scan on relation R, with which backend do we
> choose to synchronize? 
> (2) If 3/5 of those scans have finished (and there is therefore no point
> to synchronize with those scans), how do we find the two that are still
> moving?
> 
> I think storing one entry per relation being scanned, regardless of the
> number of backends, nicely avoids both of those questions.

Yeah, agreed. That's a good design.

>>> What if I just make an LRU that occupies a fixed size? Any reads or
>>> updates can start at the MRU item and work backward, and any evictions
>>> can happen at the LRU item.
>>>
>>> Does a hash table really save us cycles if we keep the list small, say,
>>> 100 items? Looking at all 100 items would only be required perhaps on a
>>> scan startup. I don't have a good sense of scale here, is following 50
>>> pointers while holding a lock a significant source of contention?
>> Hmm. If you only need to scan through it when you start the scan, I 
>> suppose it would be ok.
> 
> The idea is that the list would never grow longer than the total number
> of tables that are larger than sync_seqscan_threshold, and would have
> some maximum (to fit into fixed-size shared memory). The list would be a
> bi-directional LRU list (MRU at head, LRU at tail). When evicting an
> relation from the list due to space exhaustion, it would delete the tail
> of the list. When a new scan starts and it's already in the list, it
> would most likely just need to read a few of the hot elements at the
> head of the list before it found the relation in question. When it's not
> in the list, the scan may need to read all the way through to see that
> it's not there.

Yeah, that works well when there's just a few hot elements in the list. 
I'm not sure how large the list would need to grow until scanning it 
would become a performance bottleneck, but I suppose a list of 5-10 
elements would be perfectly fine. But is that enough?

>>> 100 is of course arbitrary, and that could grow or shrink automatically
>>> at runtime.
>> Except that it can't shrink, because we don't have any convenient moment 
>> to remove an entry from the list, other than dropping the least recently 
>> used entry out of the list when inserting a new one. And since it's in 
>> shared mem, the allocated size is fixed at boot up, so we can't grow it 
>> either.
>>
> 
> It can shrink when a new scan looks through the list and passes by
> entries that don't need to be in the list.

How do you identify an entry that doesn't need to be there?

> I don't want to over-complicate things, so if you think having a hash
> table is a better, simpler, or more readable design, I'll go with that
> instead of the list-only approach. The only reason I suggested the list-
> only approach was to see if the hash table was really gaining anything
> for us. In practice, it will be quite rare that more than 10 different
> big relations are being scanned at once, and I don't know how much of a
> gain a hash table is when there are (in all but exceptional cases) only
> a few entries.

How about implementing the list-only approach first, since you're going 
to need the LRU list anyway, and we can add consider adding the hash 
table later?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sequential scans

From

Jeff Davis

Date:

03 May 2007, 16:04:58

On Thu, 2007-05-03 at 19:27 +0100, Heikki Linnakangas wrote:
> I understand that the data structure keeps track of relations being 
> scanned, with one entry per such relation. I think that's very good 
> design, and I'm not suggesting to change that.
> 
> But what's the right size for that? We don't know how many large tables 
> there is in the database, so we can't use that. A hard-coded constant 
> maybe? What would it be? I figured we could use NBackends, because there 
> can't realistically be more than one large seq scan per backend going on 
> at any point in time. That's an upper bound, in any real world 
> application there's far less than that.

Ok, I understand you now. We'll choose the maximum size of the
table/list as the max_connections on startup.

> > Part of the reason I did this is because, with a Sync Scan, it seems
> > like there may be possible reasons to do multiple seq scans concurrently
> > in the same backend. This is completely contrived, but consider:
> > 
> > SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;
> > 
> > I'm not trying to argue for such a plan to exist right now, but if there
> > is a possibility that we'd want to create such plans in the future, I
> > think it's better to track by relation rather than by backend. 
> 
> Those two seqscans would be executed one after each other, not in 
> parallel, so you would still have just one seq scan at any given point 
> in time.

Right, I was just throwing out a hypothetical situation (if highly
contrived) in which it may help to do multiple scans at once. I think
this can be safely ignored for now though, since the planner would never
generate such a plan, and I can't think of a reasonable use case.

> >>> 100 is of course arbitrary, and that could grow or shrink automatically
> >>> at runtime.
> >> Except that it can't shrink, because we don't have any convenient moment 
> >> to remove an entry from the list, other than dropping the least recently 
> >> used entry out of the list when inserting a new one. And since it's in 
> >> shared mem, the allocated size is fixed at boot up, so we can't grow it 
> >> either.
> >>
> > 
> > It can shrink when a new scan looks through the list and passes by
> > entries that don't need to be in the list.
> 
> How do you identify an entry that doesn't need to be there?
> 

Some type of clock-like algorithm where a counter was decremented when
an element is passed up and incremented when it's chosen might work.
It's probably easier to just use a hash table though.

> > I don't want to over-complicate things, so if you think having a hash
> > table is a better, simpler, or more readable design, I'll go with that
> > instead of the list-only approach. The only reason I suggested the list-
> > only approach was to see if the hash table was really gaining anything
> > for us. In practice, it will be quite rare that more than 10 different
> > big relations are being scanned at once, and I don't know how much of a
> > gain a hash table is when there are (in all but exceptional cases) only
> > a few entries.
> 
> How about implementing the list-only approach first, since you're going 
> to need the LRU list anyway, and we can add consider adding the hash 
> table later?
> 

Sounds good. Will do.

Regards,Jeff Davis