Thread: Erroneous cost estimation for nested loop join

Erroneous cost estimation for nested loop join

From

kawamichi@tkl.iis.u-tokyo.ac.jp

Date:

09 November 2015, 10:09:16

Hi guys,

I’ve been using Postgres for research at an university, and we found some interesting queries that the optimizer of
Postgreschose wrong query plans. I will describe one example query below. 
The query is just a join of two tables, but the optimizer overestimates the cost of a nested loop join plan and selects
ahash join plan which execution time is 3.24 times slower than nested loop join in the worst case. 
We investigated the query and we found that the cost model of current Postgres makes a wrong estimation for some nested
loopjoin plans. So we posted this mail for sharing the problem. 

## Test environment:
These are machines we used for examination.
・Processor: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz (8 cores) * 2 sockets
・Memory: 64GB (128MB was allocated for shared_buffers)
・Storage: RAID6 (22D+2P) with 24 NL-SAS 10krpm HDDs
      - 19TB in total
      - Maximum sequential read throughput: 2.5GB/s (measured by dd command)
・CentOS 5.8 (linux 2.6.18-371.1.2.el5)
・Postgres 9.4.1
      - cost parameter calibration: random_page_cost = 92.89


We used two tables from TPC-H(*) benchmark dataset, and modified the join cardinality between orders and lineitem. It
is1:4 originally, but we modified it to 1:10000. So there are 10000 lineitem records per 1 orders  record. There are
10,000records in orders, 100,000,000 records in lineitem, and these tables occupy 1,261 MB on disk.  
(*)TPC-H is one of benchmark specification that supposes a decision support system (DSS).


## Query
In order to examine the cost estimation error of nested loop join, we measured the cost and the execution time of
followingtwo queries with various selectivity (the value of KEY). And we set the parameters “enable_hashjoin” and
“enable_seqscan”false not to select hash join. 


Query1
select  sum(l_quantity) from lineitem  where l_orderkey <= [KEY];

This is an example of execution plan. (KEY=100)
Aggregate  (cost=362288.74..362288.75 rows=1 width=5)
   ->  Index Scan using l_orderkey_idx2 on lineitem2  (cost=0.57..359788.73 rows=1000000 width=5)
         Index Cond: (l_orderkey <= 100)

Query2
select sum(o_totalprice), sum(l_quantity) from orders inner join lineitem on (o_orderkey = l_orderkey) where o_orderkey
<=[KEY]; 

This is an example of execution plan. (KEY=100)
 Aggregate  (cost=2296937.99..2296938.00 rows=1 width=14)
   ->  Nested Loop  (cost=0.85..2291887.98 rows=1010000 width=14)
         ->  Index Scan using orders_pkey2 on orders2  (cost=0.29..188.84 rows=101 width=13)
               Index Cond: (o_orderkey <= 100)
         ->  Index Scan using l_orderkey_idx2 on lineitem2  (cost=0.57..22578.67 rows=11142 width=9)
               Index Cond: (l_orderkey = orders2.o_orderkey)


## Results

[Figure1.png]


We show the execution time of two queries as Figure 1. In this case, the selectivity is proportional to KEY, and become
100%when KEY=10000. 

As we can see from this graph, the execution times of two queries are almost the same. Because the majority of these
costsare lineitem scanning, and the I/O workloads of these two scanning process are nearly identical. 

Let me explain more detail why these two workloads are nearly identical. Lineitem has a clustered index on l_orderkey
column,and in Query1, this index is used on scanning lineitem, so I/O workload is sequencial scan of lineitem. In
Query2,this index is also used on scanning inner table (i.e. lineitem) repeatedly on nested loop join. And outer table
(i.e.orders) is scanned by the index on o_orderkey, so the key of inner scan is ordered by l_orderkey. That is to say,
lineitemis substantially scanned from head (Figure 2).  

[Figure2.png]

So the I/O cost difference between Query1 and Query2 is the cost to scan outer table (i.e. orders), and it is very
smallbecause the outer table scan cost is very cheap in comparison to the inner table scan cost. 

But when we check the cost with explain command, postgres overestimates the cost of Query2. For example, the cost of
Query2is 5 times of Query1 at KEY 10,000 

[Figure3.png]

We can find that that the optimizer estimates the different cost value for two queries which have almost the same
executiontime.  
So, the optimizer will select hash join even if nested loop is faster, and in the worst case, execution time of hash
joinwill be about 3.24 times of the execution time of nested loop for Query2. 

We guessed the cause of this error would be in the cost model of Postgres, and investigated the source code of
optimizer,and we found the cause of this problem. It was in the index cost estimation process. On scanning inner table,
ifloop count is greater than 1, its I/O cost is counted as random access. In the case of Query2, in one loop (i.e. one
innertable scan) , much data is read sequentially with clustered index, so it seems to be wrong that optimizer thinks
itsI/O workload is random access. 

Specifically, in function “cost_index” in “costsize.c”, if loop_count > 1, the cost estimate expression is as follows:

    min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;

We modified this code. The first page is randomly accessed and the rests is sequentially accessed.

    min_IO_cost = spc_random_page_cost / (double)loop_count;
    if (pages_fetched > 1) {
        min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost / loop_count;
    }

Then we get the modified cost line as Figure 4 (gray line).

[Figure4.png]

As this result shows, the difference between Query1 and Query2(Fixed) is smaller than original one, so the cost
estimationseems to be improved. 

There is infinite query space, so this problem is thoroughly an example. To fix only this problem, that is enough to
modifythe code as previous. But we think the optimizer can still have another problems, so we want to hear your opinion
aboutthe problems like this because you have deep knowledge of Postgres. We think the cost model of optimizer is very
interestingfor research, so we want to expand the discussion from this problem, and contribute to Postgres community. 

Thanks.
Ryoji

Attachment

Re: Erroneous cost estimation for nested loop join

From

"Shulgin, Oleksandr"

Date:

09 November 2015, 10:27:26

On Mon, Nov 9, 2015 at 11:08 AM, <kawamichi@tkl.iis.u-tokyo.ac.jp> wrote:

- cost parameter calibration: random_page_cost = 92.89

This demands some explanation and raises question of value of seq_page_cost parameter -- I don't see anything about it your mail.

Alex

Re: Erroneous cost estimation for nested loop join

From

Gavin Flower

Date:

09 November 2015, 10:38:07

On 09/11/15 23:08, kawamichi@tkl.iis.u-tokyo.ac.jp wrote:
> Hi guys,
>
> I’ve been using Postgres for research at an university,
Great!

[...]
> ・Postgres 9.4.1
[..]

More knowledgeable people are sure to reply in more detail!

However, they would probably appreciate it if you can run with 9.4.5 
(the latest released version).  Running it with the beta of 9.5 would be 
a bonus!

Note that I don't know enough to say for sure that later versions would 
make any difference in this case, but at least using later later 
versions would kill lots of Red Herrings!  :-)

Cheers,
Gavin

Re: Erroneous cost estimation for nested loop join

From

Simon Riggs

Date:

09 November 2015, 10:42:12

On 9 November 2015 at 10:08, <kawamichi@tkl.iis.u-tokyo.ac.jp> wrote:

We guessed the cause of this error would be in the cost model of Postgres, and investigated the source code of optimizer, and we found the cause of this problem. It was in the index cost estimation process. On scanning inner table, if loop count is greater than 1, its I/O cost is counted as random access. In the case of Query2, in one loop (i.e. one inner table scan) , much data is read sequentially with clustered index, so it seems to be wrong that optimizer thinks its I/O workload is random access.

We don't have a clustered index in Postgres. We do store correlation stats for the index, which is used in some places to reduce cost.

Could you look some more at this and see if a model change that uses the correlation can improve this?

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Erroneous cost estimation for nested loop join

From

Tom Lane

Date:

09 November 2015, 14:37:31

kawamichi@tkl.iis.u-tokyo.ac.jp writes:
>       - cost parameter calibration: random_page_cost = 92.89

TBH, you lost me there already.  I know of no hardware on which that would
be a sane depiction of reality, so I think you've probably overfitted the
model to some particular case it was already inaccurate on.  Any results
you're getting using this setting will likely fall into the category of
"garbage in, garbage out".

What led you to choose that number?
        regards, tom lane

Re: Erroneous cost estimation for nested loop join

From

KAWAMICHI Ryoji

Date:

12 November 2015, 08:17:55

<tgl@sss.pgh.pa.us> wrote:
>
> TBH, you lost me there already.  I know of no hardware on which that would
> be a sane depiction of reality, so I think you've probably overfitted the
> model to some particular case it was already inaccurate on.  Any results
> you're getting using this setting will likely fall into the category of
> "garbage in, garbage out".
>
> What led you to choose that number?

Uh, it seems that the scale of our machine caused confusion.
It has a bit large disk array, and I think random_page_cost = 92.89 is reasonable
(the reason is described later).

This problem could be observed in normal environment like random_page_cost = 4.
I checked estimated cost of query (1) and query (2) with random_page_cost = 4,
and there were noticeable cost estimation difference between them (see an attached graph).
As I explained in the first mail, execution time of query (1) and query (2) are
expected to be almost the same. If so, there is something wrong with cost estimation
logic, and this problem is not specific to particular environment. The result I
showed was just an example of confirmation of the problem.

Next, I'd like to explain our environment and the reason of random_page_cost value.
Our machine has a RAID6 array with 24 NL-SAS HDDs and its raw I/O performance is:
   Sequential access:  2.5GB/s
   Random access (8KB block): 1.6MB/s (=200 IOPS * 8KB )

For Postgres, we have filesystems and some calculations like tuple processing or
aggregations, so sequential access becomes slower. Here I show basic performance
values of I/O intensive queries like simple scan on a large table:
   Sequential access:  200-300MB/s(*)
   Random access: 200 IOPS * 8KB = 1.6MB/s
      (*) depends on aggregation or any other calculations

So sequential access is 125-188 times faster than random access. We assume OLAP
workload, so cache hit ratio is not so high. That means performance gap between
sequential I/O and random I/O would not be mitigated by cache effect.
Therefore we think random_page_cost should be around 100 in our environment.

random_page_cost = 92.89 is the result of calibration with multiple TPC-H queries,
which covers from simple scan on single table to complex joins of multiple tables,
and this value gives good cost estimation for TPC-H queries.

Thanks.
Ryoji

Attachment

Figure2-1.png

Re: Erroneous cost estimation for nested loop join

From

KAWAMICHI Ryoji

Date:

12 November 2015, 08:19:22

<oleksandr.shulgin@zalando.de> wrote:
>>
>>       - cost parameter calibration: random_page_cost = 92.89
>>
> 
> This demands some explanation and raises question of value of seq_page_cost
> parameter -- I don't see anything about it your mail.

seq_page_cost was set to 1.0 (default), and I explained the reason about random_page_cost value in reply to Tom.
Please see it.

Thanks
Ryoji

Re: Erroneous cost estimation for nested loop join

From

KAWAMICHI Ryoji

Date:

12 November 2015, 08:22:04

<GavinFlower@archidevsys.co.nz> wrote:
> 
> More knowledgeable people are sure to reply in more detail!
> 
> However, they would probably appreciate it if you can run with 9.4.5
> (the latest released version).  Running it with the beta of 9.5 would be
> a bonus!
> 
> Note that I don't know enough to say for sure that later versions would
> make any difference in this case, but at least using later later
> versions would kill lots of Red Herrings!  :-)
> 

Difference between minor versions does not matter in this case. I confirmed 
the cost calculation logic regarding this problem was not changed between 
Postgres 9.4.1 and 9.4.5.

I heard there are some performance improvements on 9.5. It might change 
something regarding this problem. so I will try these queries on 9.5!

Thanks
Ryoji

Re: Erroneous cost estimation for nested loop join

From

KAWAMICHI Ryoji

Date:

12 November 2015, 08:30:12

<simon@2ndQuadrant.com> wrote:
>>
>> We guessed the cause of this error would be in the cost model of Postgres,
>> and investigated the source code of optimizer, and we found the cause of
>> this problem. It was in the index cost estimation process. On scanning
>> inner table, if loop count is greater than 1, its I/O cost is counted as
>> random access. In the case of Query2, in one loop (i.e. one inner table
>> scan) , much data is read sequentially with clustered index, so it seems to
>> be wrong that optimizer thinks its I/O workload is random access.
>>
>
> We don't have a clustered index in Postgres. We do store correlation stats
> for the index, which is used in some places to reduce cost.

Yes, postgres does not have a clustered index as you pointed. I meant an index
whose correlation is 1.0 by using word “clustered index”. In this case,
the index is primary key (records are physically ordered by this) and the index
was created just after the whole data was loaded. We’ve been assuming OLAP
workload for our experiments, so I think correlation = 1.0 is the basic case
for our experiments.

> Could you look some more at this and see if a model change that uses the
> correlation can improve this?

I cannot understand the question so let me clarify. Did you mean that I should
read the optimizer code more, and I can find the correlation is used to improve
cost estimation?

Of course I read them, and I know that correlation is used to determine
the value between the min cost and max cost. (The min cost is the best case
cost (i.e. correlation is 1.0), and the max cost is the worst case cost (i.e.
correlation is 0).

But in both case, I/O cost is counted as random access on scanning inner table.
I think I/O cost should be counted as sequential access when the correlation is
1.0, so I tried to modify the code as previous mail. But this modification is
just an example of solution. I’m not so familiar with optimizer code yet, so
I’m wondering this is the right way or not.

Thank you for your comment.
Ryoji

Re: Erroneous cost estimation for nested loop join

From

Jeff Janes

Date:

12 November 2015, 19:02:56

On Mon, Nov 9, 2015 at 2:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 November 2015 at 10:08, <kawamichi@tkl.iis.u-tokyo.ac.jp> wrote:
>>
>>
>> We guessed the cause of this error would be in the cost model of Postgres,
>> and investigated the source code of optimizer, and we found the cause of
>> this problem. It was in the index cost estimation process. On scanning inner
>> table, if loop count is greater than 1, its I/O cost is counted as random
>> access. In the case of Query2, in one loop (i.e. one inner table scan) ,
>> much data is read sequentially with clustered index, so it seems to be wrong
>> that optimizer thinks its I/O workload is random access.
>
>
> We don't have a clustered index in Postgres. We do store correlation stats
> for the index, which is used in some places to reduce cost.
>
> Could you look some more at this and see if a model change that uses the
> correlation can improve this?

That is already happening.  min_IO_cost is set on the assumption of
perfect correlation, max_IO_cost is set on the assumption of no
correlation, and then later in the code it interpolates between these
two based on the observed correlation:

run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

But the min_IO_cost is very pessimistic in this particular case.  So I
think that their proposed change is already exactly what you are
asking them to try.

Cheers,

Jeff

Re: Erroneous cost estimation for nested loop join

From

Jeff Janes

Date:

16 November 2015, 23:51:06

On Mon, Nov 9, 2015 at 6:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> kawamichi@tkl.iis.u-tokyo.ac.jp writes:
>>       - cost parameter calibration: random_page_cost = 92.89
>
> TBH, you lost me there already.  I know of no hardware on which that would
> be a sane depiction of reality, so I think you've probably overfitted the
> model to some particular case it was already inaccurate on.

I can easily get a ratio of random to sequential of 50, and my RAID is
nothing special.  I don't see why a high-end RAID couldn't justify 100
or more, as long as they know the cache-hit is very low. (The default
value of 4 seems to bake in the notion that 90% of even random IO is
going to be satisfied from the cache, which might be a good estimate
if you have frequently used smallish lookup tables that always get
joined to the RAM-busters, but some people aren't going to have that
type of database queries as their main load).

With the current code, a single scan out of several can get estimated
to have a higher cost than just a free-standing single scan
(loop_count > 1), and I don't see how that can ever make sense.

Right now it can only benefit from assumed cache hits (driven by
effective_cache_size) via Mackert and Lohman.

I think that, at least, it should get to claim the greater of either
the Mackert and Lohman benefit between inner scans, or the benefit of
converting some random IO to sequential within each separate inner
scan.

And really, I don't see why it should not get both benefits.  If the
pages are still in cache between inner scans, that's great.  But if
the one time they do need to be read in from disk they are read in
mostly sequentially, why is that benefit not also fully justified? I
don't see where the worry about "double-counting the cache effects"
comes from.  The effective_cache_size and io read-ahead can both apply
and both give benefits simultaneously and cumulatively.

Cheers,

Jeff

Re: Erroneous cost estimation for nested loop join

From

Robert Haas

Date:

17 November 2015, 20:37:20

On Mon, Nov 16, 2015 at 6:50 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 6:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> kawamichi@tkl.iis.u-tokyo.ac.jp writes:
>>>       - cost parameter calibration: random_page_cost = 92.89
>>
>> TBH, you lost me there already.  I know of no hardware on which that would
>> be a sane depiction of reality, so I think you've probably overfitted the
>> model to some particular case it was already inaccurate on.
>
> I can easily get a ratio of random to sequential of 50, and my RAID is
> nothing special.  I don't see why a high-end RAID couldn't justify 100
> or more, as long as they know the cache-hit is very low. (The default
> value of 4 seems to bake in the notion that 90% of even random IO is
> going to be satisfied from the cache, which might be a good estimate
> if you have frequently used smallish lookup tables that always get
> joined to the RAM-busters, but some people aren't going to have that
> type of database queries as their main load).

I agree.  What I've been thinking about is:

- If we're sequential scanning a small table, let's say less than 1/4
of shared_buffers, which is the point where synchronized scans kick
in, then assume the data is coming from shared_buffers.
- If we're scanning a medium-sized table, let's say less than
effective_cache_size, then assume the data is coming from the OS
cache.  Maybe this is the same cost as the previous case, or maybe
it's slightly more.
- Otherwise, assume that the first effective_cache_size pages are
coming from cache and the rest has to be read from disk.  This is
perhaps unrealistic, but we don't want the cost curve to be
discontinuous.

The effect of this would be that sequential scanning a small table
would look cheaper per page than sequential scanning a large table,
which is a good idea, because it's likely to be true.  Similar
adaptations could be made for index scans, index-only scans, bitmap
index scans, and bitmap heap scans.  The default value of
random_page_cost would go up, but there would be a new
cached_page_cost GUC that would apply in some cases where
seq_page_cost and/or random_page_cost apply today.

Previous attempts to improve the modeling of cache effects have
faltered because we don't know what will be cached at execution time,
and even if we did it can change very quickly and we don't want plan
instability.  But it seems to me that guessing based on the size of
the relation is pretty reasonable - essentially we'd be trying to
refine a model which says that every page is equally likely to be
cached, and that shouldn't be too high a bar to clear.

A problem with this sort of thing, of course, is that it's really hard
to test a proposed change broadly enough to be certain how it will
play out in the real world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Erroneous cost estimation for nested loop join

From

KAWAMICHI Ryoji

Date:

30 November 2015, 07:29:58

Robert Haas <robertmhaas@gmail.com> wrote:
>
> - If we're sequential scanning a small table, let's say less than 1/4
> of shared_buffers, which is the point where synchronized scans kick
> in, then assume the data is coming from shared_buffers.
> - If we're scanning a medium-sized table, let's say less than
> effective_cache_size, then assume the data is coming from the OS
> cache.  Maybe this is the same cost as the previous case, or maybe
> it's slightly more.
> - Otherwise, assume that the first effective_cache_size pages are
> coming from cache and the rest has to be read from disk.  This is
> perhaps unrealistic, but we don't want the cost curve to be
> discontinuous.

I think this improvement is so reasonable, and I expect it will be merged
into current optimizer code.

> A problem with this sort of thing, of course, is that it's really hard
> to test a proposed change broadly enough to be certain how it will
> play out in the real world.

That’s the problem we’re really interested in and trying to tackle.

For example, with extensive experiments, I’m really sure my modification of
cost model is effective for our environment, but I can’t see if it is also
efficient or unfortunately harmful in general environments.

And I think that, in postgres community, there must be (maybe buried)
knowledge on how to judge the effectiveness of cost model modifications
because someone should have considered something like that at each commit.
I’m interested in it, and hopefully would like to contribute to finding
a better way to improve the optimizer through cost model refinement.

Thanks.
Ryoji.

Re: Erroneous cost estimation for nested loop join

From

Bruce Momjian

Date:

03 December 2015, 01:42:18

On Mon, Nov 30, 2015 at 04:29:43PM +0900, KAWAMICHI Ryoji wrote:
> 
> 
> Robert Haas <robertmhaas@gmail.com> wrote:
> > 
> > - If we're sequential scanning a small table, let's say less than 1/4
> > of shared_buffers, which is the point where synchronized scans kick
> > in, then assume the data is coming from shared_buffers.
> > - If we're scanning a medium-sized table, let's say less than
> > effective_cache_size, then assume the data is coming from the OS
> > cache.  Maybe this is the same cost as the previous case, or maybe
> > it's slightly more.
> > - Otherwise, assume that the first effective_cache_size pages are
> > coming from cache and the rest has to be read from disk.  This is
> > perhaps unrealistic, but we don't want the cost curve to be
> > discontinuous.
> 
> I think this improvement is so reasonable, and I expect it will be merged 
> into current optimizer code.
> 
> 
> > A problem with this sort of thing, of course, is that it's really hard
> > to test a proposed change broadly enough to be certain how it will
> > play out in the real world.
> 
> That’s the problem we’re really interested in and trying to tackle.
> 
> For example, with extensive experiments, I’m really sure my modification of 
> cost model is effective for our environment, but I can’t see if it is also 
> efficient or unfortunately harmful in general environments.
> 
> And I think that, in postgres community, there must be (maybe buried) 
> knowledge on how to judge the effectiveness of cost model modifications 
> because someone should have considered something like that at each commit.
> I’m interested in it, and hopefully would like to contribute to finding 
> a better way to improve the optimizer through cost model refinement.

No one mentioned the random page docs so I will quote it here:
http://www.postgresql.org/docs/9.5/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
Random access to mechanical disk storage is normally much more expensivethan four times sequential access. However, a
lowerdefault is used(4.0) because the majority of random accesses to disk, such as indexedreads, are assumed to be in
cache.The default value can be thought ofas modeling random access as 40 times slower than sequential, whileexpecting
90%of random reads to be cached.If you believe a 90% cache rate is an incorrect assumption for yourworkload, you can
increaserandom_page_cost to better reflect the truecost of random storage reads. Correspondingly, if your data is
likelytobe completely in cache, such as when the database is smaller than thetotal server memory, decreasing
random_page_costcan be appropriate.Storage that has a low random read cost relative to sequential, e.g.solid-state
drives,might also be better modeled with a lower value forrandom_page_cost.

What we don't have is way to know how much is in the cache, not only at
planning time, but at execution time.  (Those times are often
different for prepared queries.)  I think that is the crux of what has
to be addressed here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

Re: Erroneous cost estimation for nested loop join

From

Robert Haas

Date:

08 December 2015, 16:30:34

On Wed, Dec 2, 2015 at 8:42 PM, Bruce Momjian <bruce@momjian.us> wrote:
> No one mentioned the random page docs so I will quote it here:
>
>         http://www.postgresql.org/docs/9.5/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
>
>         Random access to mechanical disk storage is normally much more expensive
>         than four times sequential access. However, a lower default is used
>         (4.0) because the majority of random accesses to disk, such as indexed
>         reads, are assumed to be in cache. The default value can be thought of
>         as modeling random access as 40 times slower than sequential, while
>         expecting 90% of random reads to be cached.
>
>         If you believe a 90% cache rate is an incorrect assumption for your
>         workload, you can increase random_page_cost to better reflect the true
>         cost of random storage reads. Correspondingly, if your data is likely to
>         be completely in cache, such as when the database is smaller than the
>         total server memory, decreasing random_page_cost can be appropriate.
>         Storage that has a low random read cost relative to sequential, e.g.
>         solid-state drives, might also be better modeled with a lower value for
>         random_page_cost.
>
> What we don't have is way to know how much is in the cache, not only at
> planning time, but at execution time.  (Those times are often
> different for prepared queries.)  I think that is the crux of what has
> to be addressed here.

I think that paragraph is more of an apology for the system that we've
got than a description of what a good one would look like.  If I have
a 1MB table and a 1TB, they are not equally likely to be cached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company