Thread: Increasing parallelism of queries while using file fdw and partitions

Increasing parallelism of queries while using file fdw and partitions

From

Patrick Mulrooney

Date:

19 December 2018, 04:39:36

Wondering if anyone had any thoughts on how to tweak my setup to get it 
to read many files at once instead of one at a time when using file fdw 
and partitions. We have a bunch of data tied up in files (each file > 4M 
rows, 5,000+ files per year) that I would like to be able to query 
directly using FDW. The files are genomic VCF format and I find that 
vcf_fdw ( https://github.com/ergo70/vcf_fdw ) works really well to read 
the data. We only want to be able to search the data as quickly as 
possible, no updates / deletes / ...

I gave an example below of the basic setup and the output of explain 
analyze. I get the same performance if I setup the table such that the 
thousands of files end up in one non-partitioned table or setup each 
file as it's own partition of the table.

I have tried increasing ( / decreasing ) the worker threads and workers, 
but don't see any change in the number of files open at any given time. 
I tried reducing the cost of parallel queries to force them to run, but 
can't get them to kick in.

Any ideas or anything I can try?

Thanks!

Pat

PostgreSQL:  PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
Multicorn: 1.3.5
VCF_FDW ( https://github.com/ergo70/vcf_fdw ) : 1.0.0


CREATE DATABASE variants;

CREATE EXTENSION multicorn;

CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper');

CREATE SCHEMA vcf;

CREATE TABLE vcf.variants ( ..., species text, ... ) PARTITION BY LIST ( species );

CREATE FOREIGN TABLE vcf.human ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'human', suffix
'.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.human FOR VALUES IN ( 'human' );

CREATE FOREIGN TABLE vcf.dog ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'dog', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.dog FOR VALUES IN ( 'dog' );

CREATE FOREIGN TABLE vcf.cat ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'cat', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.cat FOR VALUES IN ( 'cat' );

* My real data repeats this 1000+ more times

EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM vcf.variants WHERE chrom = '1' AND pos = 10120 LIMIT 1000;

On my real data I get the following results:
--------------------------
                                                                                  QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=20.00..352020.00 rows=1000 width=347) (actual time=445.548..101709.307 rows=20 loops=1)
    ->  Append  (cost=20.00..3555200000.00 rows=10100000 width=347) (actual time=445.547..101709.285 rows=20 loops=1)
          ->  Foreign Scan on dog  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=198.653..198.654 rows=0
loops=1)
                Filter: ((chrom = '1'::text) AND (pos = 10120))
          ->  Foreign Scan on cat  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=111.840..111.840 rows=0
loops=1)
                Filter: ((chrom = '1'::text) AND (pos = 10120))
          ->  Foreign Scan on human  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=135.050..138.534 rows=1
loops=1)
                Filter: ((chrom = '1'::text) AND (pos = 10120))
          ... repeats many more times for each partition
  Planning time: 613.815 ms
  Execution time: 101873.880 ms
(2024 rows)

Re: Increasing parallelism of queries while using file fdw andpartitions

From

Justin Pryzby

Date:

19 December 2018, 06:51:49

On Tue, Dec 18, 2018 at 08:39:36PM -0800, Patrick Mulrooney wrote:
> Wondering if anyone had any thoughts on how to tweak my setup to get it to
> read many files at once instead of one at a time when using file fdw and
> partitions.

I found this:

https://www.postgresql.org/docs/current/parallel-safety.html
|The following operations are always parallel restricted.
|Scans of foreign tables, unless the foreign data wrapper has an IsForeignScanParallelSafe API which indicates
otherwise.

https://github.com/ergo70/vcf_fdw/blob/master/vcf_fdw/__init__.py
=> has no such API marker, since it's couple years old, same as multicorn.

Justin

Re: Increasing parallelism of queries while using file fdw and partitions

From

Patrick Mulrooney

Date:

19 December 2018, 07:36:45

Justin,

Thanks for the idea. I pulled down the source for multicorn and added that to it. I do not see parallel queries in the
analyzeoutput (unless I force it and then it only gets one worker), but it does look like it is reading more than one
fileat once if I go with a non-partitioned table that looks at all the files. Not any better if I have the table split
upinto partitions.  

So it’s better, but still curious if this would work with partitions.

Thanks again.

Pat

> On Dec 18, 2018, at 22:51, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
>> On Tue, Dec 18, 2018 at 08:39:36PM -0800, Patrick Mulrooney wrote:
>> Wondering if anyone had any thoughts on how to tweak my setup to get it to
>> read many files at once instead of one at a time when using file fdw and
>> partitions.
>
> I found this:
>
> https://www.postgresql.org/docs/current/parallel-safety.html
> |The following operations are always parallel restricted.
> |Scans of foreign tables, unless the foreign data wrapper has an IsForeignScanParallelSafe API which indicates
otherwise.
>
> https://github.com/ergo70/vcf_fdw/blob/master/vcf_fdw/__init__.py
> => has no such API marker, since it's couple years old, same as multicorn.
>
> Justin