Re: Question about optimising (Postgres_)FDW - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Question about optimising (Postgres_)FDW |
Date | |
Msg-id | 534EAB5A.2040708@krosing.net Whole thread Raw |
In response to | Re: Question about optimising (Postgres_)FDW (Hannu Krosing <hannu@krosing.net>) |
Responses |
Re: Question about optimising (Postgres_)FDW
|
List | pgsql-hackers |
On 04/16/2014 03:16 PM, Hannu Krosing wrote: > On 04/16/2014 01:35 PM, Etsuro Fujita wrote: >> (2014/04/16 6:55), Hannu Krosing wrote: > ... >> Maybe I'm missing something, but I think that you can do what I think >> you'd like to do by the following procedure: > No, what I'd like PostgreSQL to do is to > > 1. select the id+set from local table > 2. select the rows from remote table with WHERE ID IN (<set selected in > step 1>) > 3. then join the original set to selected set, with any suitable join > strategy > > The things I do not want are > > A. selecting all rows from remote table > (this is what your examples below do) > > or > > B. selecting rows from remote table by single selects using "ID = $" > (this is something that I managed to do by some tweaking of costs) > > as A will be always slow if there are millions of rows in remote table > and B is slow(ish) when the idset is over a few hundred ids > > I hope this is a bit better explanation than I provided before . > > Cheers > Hannu > > P.S. I am not sure if this is a limitation of postgres_fdw or postgres > itself > > P.P.S I tested a little with with Multicorn an postgresql did not > request row > counts for any IN plans, so it may be that the planner does not consider > this > kind of plan at all. (testing was on PgSQL 9.3.4) > > Hannu Also a sample run of the two plans to illustrate my point How it is run now: testdb=# explain analyse verbose select r.data, l.data from onemillion_pgfdw r join onemillion l on r.id = l.id and l.id between 100000 and 100100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------Hash Join (cost=111.61..198.40 rows=1 width=16) (actual time=7534.360..8731.541 rows=101 loops=1) Output: r.data, l.data Hash Cond: (r.id = l.id) -> Foreign Scan on public.onemillion_pgfdwr (cost=100.00..178.25 rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1) Output: r.id, r.inserted, r.data RemoteSQL: SELECT id, data FROM public.onemillion -> Hash (cost=10.39..10.39 rows=98 width=12) (actual time=0.179..0.179 rows=101 loops=1) Output: l.data, l.id Buckets: 1024 Batches: 1 Memory Usage: 5kB -> Index Scan using onemillion_pkey on public.onemillion l (cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101 loops=1) Output: l.data, l.id Index Cond: ((l.id >= 100000) AND (l.id <= 100100))Total runtime:8732.213 ms (13 rows) Time: 8733.799 ms And how the above query should be planned/executed: testdb=# explain analyse verbose select r.data, l.data from (select * from onemillion_pgfdw where id = any (array(select id from onemillion where id between 100000 and 100100))) r join onemillion l on r.id = l.id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=110.81..1104.30 rows=111 width=16) (actual time=2.756..3.738 rows=101 loops=1) Output: onemillion_pgfdw.data, l.data InitPlan 1 (returns $0) -> Index Only Scanusing onemillion_pkey on public.onemillion (cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101 loops=1) Output: onemillion.id Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <= 100100)) Heap Fetches: 101 -> Foreign Scan on public.onemillion_pgfdw (cost=100.00..163.41 rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1) Output: onemillion_pgfdw.id, onemillion_pgfdw.inserted, onemillion_pgfdw.data Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id = ANY ($1::integer[]))) -> Index Scan using onemillion_pkey on public.onemillion l (cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1 loops=101) Output: l.id, l.inserted, l.data Index Cond: (l.id = onemillion_pgfdw.id)Total runtime: 4.469 ms (14 rows) Time: 6.437 ms >> postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000'); >> ALTER SERVER >> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in >> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100); >> QUERY PLAN >> ----------------------------------------------------------------------------------------------- >> >> Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30) >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted, >> onemillion_pgsql.data >> Hash Cond: (onemillion_pgsql.id = onemillion.id) >> -> Foreign Scan on public.onemillion_pgsql >> (cost=1000.00..39334.00 rows=1000000 width=29) >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted, >> onemillion_pgsql.data >> Remote SQL: SELECT id, inserted, data FROM public.onemillion >> -> Hash (cost=21.85..21.85 rows=100 width=4) >> Output: onemillion.id >> -> Limit (cost=0.00..20.85 rows=100 width=4) >> Output: onemillion.id >> -> Seq Scan on public.onemillion (cost=0.00..20834.00 >> rows=99918 width=4) >> Output: onemillion.id >> Filter: (onemillion.data > '0.9'::text) >> Planning time: 0.690 ms >> (14 rows) >> >> or, that as Tom mentioned, by disabling the use_remote_estimate function: >> >> postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET >> use_remote_estimate 'false'); >> ALTER FOREIGN TABLE >> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in >> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> >> Hash Semi Join (cost=123.10..41083.21 rows=100 width=30) >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted, >> onemillion_pgsql.data >> Hash Cond: (onemillion_pgsql.id = onemillion.id) >> -> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00 >> rows=1000000 width=30) >> Output: onemillion_pgsql.id, onemillion_pgsql.inserted, >> onemillion_pgsql.data >> Remote SQL: SELECT id, inserted, data FROM public.onemillion >> -> Hash (cost=21.85..21.85 rows=100 width=4) >> Output: onemillion.id >> -> Limit (cost=0.00..20.85 rows=100 width=4) >> Output: onemillion.id >> -> Seq Scan on public.onemillion (cost=0.00..20834.00 >> rows=99918 width=4) >> Output: onemillion.id >> Filter: (onemillion.data > '0.9'::text) >> Planning time: 0.215 ms >> (14 rows) >> >> Thanks, >> >> Best regards, >> Etsuro Fujita >> >>
pgsql-hackers by date: