Home > mailing lists
Re: Question about optimising (Postgres_)FDW - Mailing list pgsql-hackers

From	Hannu Krosing
Subject	Re: Question about optimising (Postgres_)FDW
Date	April 16, 2014 16:10:16
Msg-id	534EAB5A.2040708@krosing.net Whole thread Raw
In response to	Re: Question about optimising (Postgres_)FDW (Hannu Krosing <hannu@krosing.net>)
Responses	Re: Question about optimising (Postgres_)FDW
List	pgsql-hackers
Tree view
On 04/16/2014 03:16 PM, Hannu Krosing wrote:
> On 04/16/2014 01:35 PM, Etsuro Fujita wrote:
>> (2014/04/16 6:55), Hannu Krosing wrote:
> ...
>> Maybe I'm missing something, but I think that you can do what I think
>> you'd like to do by the following procedure:
> No, what I'd like PostgreSQL to do is to
>
> 1. select the id+set from local table
> 2. select the rows from remote table with WHERE ID IN (<set selected in
> step 1>)
> 3. then join the original set to selected set, with any suitable join
> strategy
>
> The things I do not want are
>
> A. selecting all rows from remote table
>     (this is what your examples below do)
>
> or
>
> B. selecting rows from remote table by single selects using "ID = $"
>     (this is something that I managed to do by some tweaking of costs)
>
> as A will be always slow if there are millions of rows in remote table
> and B is slow(ish) when the idset is over a few hundred ids
>
> I hope this is a bit better explanation than I provided before .
>
> Cheers
> Hannu
>
> P.S. I am not sure if this is a limitation of postgres_fdw or postgres
> itself
>
> P.P.S I tested a little with with Multicorn an postgresql did not
> request row
> counts for any IN plans, so it may be that the planner does not consider
> this
> kind of plan at all. (testing was on PgSQL 9.3.4)
>
> Hannu
Also a sample run of the two plans to illustrate my point

How it is run now:

testdb=# explain analyse verbose
select r.data, l.data from onemillion_pgfdw r join onemillion l   on r.id = l.id and l.id between 100000 and 100100;
                                                               
 
QUERY
PLAN                                                                     

------------------------------------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=111.61..198.40 rows=1 width=16) (actual
 
time=7534.360..8731.541 rows=101 loops=1)  Output: r.data, l.data  Hash Cond: (r.id = l.id)  ->  Foreign Scan on
public.onemillion_pgfdwr  (cost=100.00..178.25
 
rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1)        Output: r.id, r.inserted, r.data
RemoteSQL: SELECT id, data FROM public.onemillion  ->  Hash  (cost=10.39..10.39 rows=98 width=12) (actual
 
time=0.179..0.179 rows=101 loops=1)        Output: l.data, l.id        Buckets: 1024  Batches: 1  Memory Usage: 5kB
  ->  Index Scan using onemillion_pkey on public.onemillion l 
 
(cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101
loops=1)              Output: l.data, l.id              Index Cond: ((l.id >= 100000) AND (l.id <= 100100))Total
runtime:8732.213 ms
 
(13 rows)

Time: 8733.799 ms


And how the above query should be planned/executed:

testdb=# explain analyse verbose
select r.data, l.data from (select * from onemillion_pgfdw where id = any (array(select id
from onemillion where id between 100000 and 100100))) r join onemillion l   on r.id = l.id;
                                     
 
QUERY
PLAN                                                                    

----------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=110.81..1104.30 rows=111 width=16) (actual
 
time=2.756..3.738 rows=101 loops=1)  Output: onemillion_pgfdw.data, l.data  InitPlan 1 (returns $0)    ->  Index Only
Scanusing onemillion_pkey on public.onemillion 
 
(cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101
loops=1)          Output: onemillion.id          Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <=
100100))          Heap Fetches: 101  ->  Foreign Scan on public.onemillion_pgfdw  (cost=100.00..163.41
rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1)        Output: onemillion_pgfdw.id,
onemillion_pgfdw.inserted,
onemillion_pgfdw.data        Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id =
ANY ($1::integer[])))  ->  Index Scan using onemillion_pkey on public.onemillion l 
(cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1
loops=101)        Output: l.id, l.inserted, l.data        Index Cond: (l.id = onemillion_pgfdw.id)Total runtime: 4.469
ms
(14 rows)

Time: 6.437 ms




>> postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
>> ALTER SERVER
>> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
>> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>>                                           QUERY PLAN
>> -----------------------------------------------------------------------------------------------
>>
>>  Hash Semi Join  (cost=1023.10..41983.21 rows=100 width=30)
>>    Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>>    Hash Cond: (onemillion_pgsql.id = onemillion.id)
>>    ->  Foreign Scan on public.onemillion_pgsql 
>> (cost=1000.00..39334.00 rows=1000000 width=29)
>>          Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>>          Remote SQL: SELECT id, inserted, data FROM public.onemillion
>>    ->  Hash  (cost=21.85..21.85 rows=100 width=4)
>>          Output: onemillion.id
>>          ->  Limit  (cost=0.00..20.85 rows=100 width=4)
>>                Output: onemillion.id
>>                ->  Seq Scan on public.onemillion  (cost=0.00..20834.00
>> rows=99918 width=4)
>>                      Output: onemillion.id
>>                      Filter: (onemillion.data > '0.9'::text)
>>  Planning time: 0.690 ms
>> (14 rows)
>>
>> or, that as Tom mentioned, by disabling the use_remote_estimate function:
>>
>> postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
>> use_remote_estimate 'false');
>> ALTER FOREIGN TABLE
>> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
>> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>>                                           QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>>
>>  Hash Semi Join  (cost=123.10..41083.21 rows=100 width=30)
>>    Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>>    Hash Cond: (onemillion_pgsql.id = onemillion.id)
>>    ->  Foreign Scan on public.onemillion_pgsql  (cost=100.00..38434.00
>> rows=1000000 width=30)
>>          Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>>          Remote SQL: SELECT id, inserted, data FROM public.onemillion
>>    ->  Hash  (cost=21.85..21.85 rows=100 width=4)
>>          Output: onemillion.id
>>          ->  Limit  (cost=0.00..20.85 rows=100 width=4)
>>                Output: onemillion.id
>>                ->  Seq Scan on public.onemillion  (cost=0.00..20834.00
>> rows=99918 width=4)
>>                      Output: onemillion.id
>>                      Filter: (onemillion.data > '0.9'::text)
>>  Planning time: 0.215 ms
>> (14 rows)
>>
>> Thanks,
>>
>> Best regards,
>> Etsuro Fujita
>>
>>
pgsql-hackers by date:
From: Robert Haas
Date: 16 April 2014, 16:08:49
Subject: Re: Dynamic Shared Memory stuff
From: Andres Freund
Date: 16 April 2014, 16:11:54
Subject: Re: bgworker crashed or not?
Re: Question about optimising (Postgres_)FDW - Mailing list pgsql-hackers

Previous

Next