Thread: Cursor-based results: bafflingly slow

Cursor-based results: bafflingly slow

From

Robin Houston

Date:

03 July 2009, 10:01:25

Dear list,

I'm using setFetchSize(100) to stream results from the server, and I'm
really puzzled by how slow some queries are to execute. I don't think
I'm doing anything obviously stupid: I've setAutoCommit(false) and so
forth.

For example, here's part of the output from loglevel=2:

13:03:08.871 (2)  FE=> Parse(stmt=S_9,query="select polypeptide.feature_id
  , featureprop_type_cv.name as type_cv
  , featureprop_type.name as type
  , featureprop.value
from feature polypeptide
join featureprop on featureprop.feature_id = polypeptide.feature_id
join cvterm featureprop_type on featureprop.type_id = featureprop_type.cvterm_id
join cv featureprop_type_cv on featureprop_type.cv_id = featureprop_type.cv_id
where polypeptide.type_id in (
191
)
and polypeptide.organism_id = (
   select organism_id from organism
   where common_name = $1
)
order by polypeptide.feature_id",oids={1043})
13:03:08.877 (2)  FE=> Bind(stmt=S_9,portal=C_10,$1=<Etenella>)
13:03:08.878 (2)  FE=> Describe(portal=C_10)
13:03:08.878 (2)  FE=> Execute(portal=C_10,limit=100)
13:03:08.878 (2)  FE=> Sync


(The database schema is Chado – http://gmod.org/wiki/Schema)

It has been running for more than half an hour with no apparent
response yet. If I try to do something equivalent in psql, I can't
find any way to make it take more than a few seconds.

For example, if I use "limit 100" to simulate the effect of fetching
the first 100 results:

prepare query as
select polypeptide.feature_id
  , featureprop_type_cv.name as type_cv
  , featureprop_type.name as type
  , featureprop.value
from feature polypeptide
join featureprop on featureprop.feature_id = polypeptide.feature_id
join cvterm featureprop_type on featureprop.type_id = featureprop_type.cvterm_id
join cv featureprop_type_cv on featureprop_type.cv_id = featureprop_type.cv_id
where polypeptide.type_id in (
191
)
and polypeptide.organism_id = (
   select organism_id from organism
   where common_name = $1
)
order by polypeptide.feature_id
limit 100
;

explain analyze execute query('Etenella');

then I get:

 Limit  (cost=1.95..42662.18 rows=100 width=612) (actual
time=219.367..253.029 rows=100 loops=1)
  InitPlan
    ->  Seq Scan on organism  (cost=0.00..1.95 rows=1 width=4)
(actual time=8.175..8.196 rows=1 loops=1)
          Filter: ((common_name)::text = $1)
  ->  Nested Loop  (cost=0.00..2245634.37 rows=5264 width=612)
(actual time=219.358..252.312 rows=100 loops=1)
        ->  Nested Loop  (cost=0.00..2245341.09 rows=188 width=96)
(actual time=186.960..218.784 rows=4 loops=1)
              ->  Nested Loop  (cost=0.00..1948637.05 rows=37696
width=56) (actual time=166.944..175.615 rows=4 loops=1)
                    ->  Index Scan using feature_genedb_idx1 on
feature polypeptide  (cost=0.00..108669.87 rows=32192 width=4) (actual
time=66.732..66.732 rows=1 loops=1)
                          Index Cond: ((organism_id = $0) AND (type_id = 191))
                    ->  Index Scan using featureprop_idx1 on
featureprop  (cost=0.00..55.64 rows=121 width=56) (actual
time=100.187..108.833 rows=4 loops=1)
                          Index Cond: (featureprop.feature_id =
polypeptide.feature_id)
              ->  Index Scan using cvterm_pkey on cvterm
featureprop_type  (cost=0.00..7.86 rows=1 width=52) (actual
time=10.766..10.771 rows=1 loops=4)
                    Index Cond: (featureprop_type.cvterm_id =
featureprop.type_id)
                    Filter: (featureprop_type.cv_id = featureprop_type.cv_id)
        ->  Seq Scan on cv featureprop_type_cv  (cost=0.00..1.28
rows=28 width=516) (actual time=8.101..8.193 rows=25 loops=4)
 Total runtime: 270.637 ms


Any suggestions?

Thanks,
Robin

Re: Cursor-based results: bafflingly slow

From

Oliver Jowett

Date:

03 July 2009, 10:17:15

Robin Houston wrote:

> For example, if I use "limit 100" to simulate the effect of fetching
> the first 100 results:

They're not really equivalent. A portal with a limited fetchsize is more
like DECLARE CURSOR, not LIMIT.

What does the plan look like if you plan it without the LIMIT?

-O

Cursor-based results: bafflingly slow

From

Robin Houston

Date:

03 July 2009, 10:40:55

2009/7/3 Oliver Jowett <oliver@opencloud.com>:

> They're not really equivalent. A portal with a limited fetchsize is more
> like DECLARE CURSOR, not LIMIT.
>
> What does the plan look like if you plan it without the LIMIT?

Hmm, well... If I do the same as before, but without the limit, then the plan is:

Sort (cost=353822.56..353835.72 rows=5264 width=612)
Sort Key: polypeptide.feature_id

InitPlan
-> Seq Scan on organism (cost=0.00..1.95 rows=1 width=4)

Filter: ((common_name)::text = $1)

-> Nested Loop (cost=88664.31..352053.24 rows=5264 width=612)
-> Hash Join (cost=88663.00..351946.66 rows=188 width=96)
Hash Cond: (featureprop.feature_id = polypeptide.feature_id)
-> Hash Join (cost=2790.36..264059.63 rows=51268 width=96)
Hash Cond: (featureprop.type_id = featureprop_type.cvterm_id)
-> Seq Scan on featureprop (cost=0.00..222284.97 rows=10259097 width=56)
-> Hash (cost=2785.51..2785.51 rows=388 width=52)
-> Seq Scan on cvterm featureprop_type (cost=0.00..2785.51 rows=388 width=52)
Filter: (cv_id = cv_id)
-> Hash (cost=85344.24..85344.24 rows=32192 width=4)
-> Bitmap Heap Scan on feature polypeptide (cost=827.57..85344.24 rows=32192 width=4)
Recheck Cond: ((organism_id = $0) AND (type_id = 191))
-> Bitmap Index Scan on feature_genedb_idx1 (cost=0.00..819.52 rows=32192 width=0)

Index Cond: ((organism_id = $0) AND (type_id = 191))

-> Materialize (cost=1.31..1.59 rows=28 width=516)

-> Seq Scan on cv featureprop_type_cv (cost=0.00..1.28 rows=28 width=516)

Surely the server is not so stupid as to use an unnecessary server-side sort, if I've explicitly asked to fetch the rows 100 at a time?

Is there any easy way to find out what execution plan the server is actually using? I suppose I could ask the DBAs to enable auto_explain on the dev database.

If this *is* the problem, do you know any way round it. Obviously I want the plan that *doesn't* involve sorting everything before returning anything.

If I explicitly create a cursor, with no limit, then it does return the first 100 rows very quickly. On the other hand, I have to hard-code the parameter this way (because I don't know any way, in psql, to use a bind variable with an explicit cursor.)

begin;

declare polypeptide_props_c no scroll cursor for [...query...]

fetch forward 100 from polypeptide_props_c;

rollback;

This fetch completes in 31.313 ms.

Thanks for your help so far!

Robin

Re: Cursor-based results: bafflingly slow

From

Oliver Jowett

Date:

03 July 2009, 11:13:08

Robin Houston wrote:
> 2009/7/3 Oliver Jowett <oliver@opencloud.com <mailto:oliver@opencloud.com>>:
>> They're not really equivalent. A portal with a limited fetchsize is more
>> like DECLARE CURSOR, not LIMIT.
>>
>> What does the plan look like if you plan it without the LIMIT?
>
> Hmm, well... If I do the same as before, but without the limit, then the
> plan is: [.. different plan with an outermost Sort step ..]

> Surely the server is not so stupid as to use an unnecessary server-side
> sort, if I've explicitly asked to fetch the rows 100 at a time?

I'm not too familiar with the planner any more, but as I understand it
this is what's happening:

Think about the difference between

(1) SELECT * FROM hugetable ORDER BY some_key;
(2) SELECT * FROM hugetable ORDER BY some_key LIMIT 100;

where there is a suitable index on some_key.

For (1) a seqscan + local sort is likely to be faster (because a
sequential scan is faster than doing a random-access index scan over the
entire table)
For (2) an index scan is likely to be faster, because you'll stop after
at most 100 random-access lookups, which is cheaper than scanning the
entire table.

If you take case (1) and set the fetchsize to 100, that doesn't actually
change things at all - it doesn't turn into case (2)! Portals let you
grab results a bit at a time (and let you give up on the query halfway
through easily), but they don't change the underlying plan at all; the
server plans on the assumption that the query will run to completion. In
fact, it doesn't even know the fetchsize that will eventually be used
during execution when it is doing the planning step. (If you look back
at your connection tracing, you'll see that the fetchsize is only
provided in the Execute message; query planning happens back when the
Parse message is sent)

> Is there any easy way to find out what execution plan the server is
> actually using? I suppose I could ask the DBAs to enable auto_explain on
> the dev database.

You should be able to issue an EXPLAIN ANALYZE via JDBC; I can't
remember exactly how the results get back to you though (might be
SQLWarnings, or a separate resultset)

> If this *is* the problem, do you know any way round it. Obviously I want
> the plan that *doesn't* involve sorting everything before returning
> anything.

If you want to stream results promptly, you may be stuck with removing
that ORDER BY. I'm not sure if there's any way to say to the planner
"give me a plan with a low startup cost, even if it has a higher total
cost" which seems to be what you want here.

Or if you really do only care about the first N rows, then put a LIMIT
in your query.

> If I explicitly create a cursor, with no limit, then it does return the
> first 100 rows very quickly. On the other hand, I have to hard-code the
> parameter this way (because I don't know any way, in psql, to use a bind
> variable with an explicit cursor.)

I came across some (very old) posts in the archives that suggested that
DECLARE CURSOR has a fudge where it ranks plan costs assuming that only
some rows (10%?) will actually be fetched. I'm not sure if that's true
of the current planner.

-O

Re: Cursor-based results: bafflingly slow

From

Tom Lane

Date:

03 July 2009, 11:24:50

Oliver Jowett <oliver@opencloud.com> writes:
> If you want to stream results promptly, you may be stuck with removing
> that ORDER BY. I'm not sure if there's any way to say to the planner
> "give me a plan with a low startup cost, even if it has a higher total
> cost" which seems to be what you want here.

You need to use a cursor for that.  The problem with the fetchsize
mechanism is that it doesn't give the planner any clue that the user
might not intend to fetch the whole result, or might prefer low startup
cost to low total cost anyway.

> I came across some (very old) posts in the archives that suggested that
> DECLARE CURSOR has a fudge where it ranks plan costs assuming that only
> some rows (10%?) will actually be fetched. I'm not sure if that's true
> of the current planner.

Exactly.  As of 8.4 you can control the percentage, too.

            regards, tom lane

Re: Cursor-based results: bafflingly slow

From

Robin Houston

Date:

03 July 2009, 11:38:20

2009/7/3 Tom Lane <tgl@sss.pgh.pa.us>

You need to use a cursor for that. The problem with the fetchsize
mechanism is that it doesn't give the planner any clue that the user
might not intend to fetch the whole result, or might prefer low startup
cost to low total cost anyway.

Thank you very much (and Oliver too). That's what I needed to know. I'll change my code to use explicit cursors.

Robin