Re: CustomScan under the Gather node? - Mailing list pgsql-hackers
From | Kouhei Kaigai |
---|---|
Subject | Re: CustomScan under the Gather node? |
Date | |
Msg-id | 9A28C8860F777E439AA12E8AEA7694F8011A3071@BPXM15GP.gisp.nec.co.jp Whole thread Raw |
In response to | CustomScan under the Gather node? (Kouhei Kaigai <kaigai@ak.jp.nec.com>) |
Responses |
Re: CustomScan under the Gather node?
|
List | pgsql-hackers |
> If I would make a proof-of-concept patch with interface itself, it > seems to me file_fdw may be a good candidate for this enhancement. > It is not a field for postgres_fdw. > The attached patch is enhancement of FDW/CSP interface and PoC feature of file_fdw to scan source file partially. It was smaller enhancement than my expectations. It works as follows. This query tried to read 20M rows from a CSV file, using 3 background worker processes. postgres=# set max_parallel_degree = 3; SET postgres=# explain analyze select * from test_csv where id % 20 = 6; QUERY PLAN -------------------------------------------------------------------------------- Gather (cost=1000.00..194108.60 rows=94056 width=52) (actual time=0.570..19268.010 rows=2000000 loops=1) Number of Workers: 3 -> Parallel Foreign Scan on test_csv (cost=0.00..183703.00 rows=94056 width=52) (actual time=0.180..12744.655 rows=500000 loops=4) Filter: ((id % 20) = 6) Rows Removed by Filter: 9500000 Foreign File: /tmp/testdata.csv Foreign File Size: 1504892535 Planning time: 0.147 ms Execution time: 19330.201 ms (9 rows) I'm not 100% certain whether this implementation of file_fdw is reasonable for partial read, however, the callbacks located on the following functions enabled to implement a parallel-aware custom logic based on the coordination information. > * ExecParallelEstimate > * ExecParallelInitializeDSM > * ExecParallelInitializeWorker Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: Kaigai Kouhei(海外 浩平) > Sent: Thursday, January 28, 2016 9:33 AM > To: 'Robert Haas' > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] CustomScan under the Gather node? > > > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > > > What enhancement will be necessary to implement similar feature of > > > partial seq-scan using custom-scan interface? > > > > > > It seems to me callbacks on the three points below are needed. > > > * ExecParallelEstimate > > > * ExecParallelInitializeDSM > > > * ExecParallelInitializeWorker > > > > > > Anything else? > > > Does ForeignScan also need equivalent enhancement? > > > > For postgres_fdw, running the query from a parallel worker would > > change the transaction semantics. Suppose you begin a transaction, > > UPDATE data on the foreign server, and then run a parallel query. If > > the leader performs the ForeignScan it will see the uncommitted > > UPDATE, but a worker would have to make its own connection which not > > be part of the same transaction and which would therefore not see the > > update. That's a problem. > > > Ah, yes, as long as FDW driver ensure the remote session has no > uncommitted data, pg_export_snapshot() might provide us an opportunity, > however, once a session writes something, FDW driver has to prohibit it. > > > Also, for postgres_fdw, and many other FDWs I suspect, the assumption > > is that most of the work is being done on the remote side, so doing > > the work in a parallel worker doesn't seem super interesting. Instead > > of incurring transfer costs to move the data from remote to local, we > > incur two sets of transfer costs: first remote to local, then worker > > to leader. Ouch. I think a more promising line of inquiry is to try > > to provide asynchronous execution when we have something like: > > > > Append > > -> Foreign Scan > > -> Foreign Scan > > > > ...so that we can return a row from whichever Foreign Scan receives > > data back from the remote server first. > > > > So it's not impossible that an FDW author could want this, but mostly > > probably not. I think. > > > Yes, I also have same opinion. Likely, local parallelism is not > valuable for the class of FDWs that obtains data from the remote > server (e.g, postgres_fdw, ...), expect for the case when packing > and unpacking cost over the network is major bottleneck. > > On the other hands, it will be valuable for the class of FDW that > performs as a wrapper to local data structure, as like current > partial seq-scan doing. (e.g, file_fdw, ...) > Its data source is not under the transaction control, and 'remote > execution' of these FDWs are eventually executed on the local > computing resources. > > If I would make a proof-of-concept patch with interface itself, it > seems to me file_fdw may be a good candidate for this enhancement. > It is not a field for postgres_fdw. > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project > KaiGai Kohei <kaigai@ak.jp.nec.com>
Attachment
pgsql-hackers by date: