Thread: Increasing parallelism of queries while using file fdw and partitions
Wondering if anyone had any thoughts on how to tweak my setup to get it to read many files at once instead of one at a time when using file fdw and partitions. We have a bunch of data tied up in files (each file > 4M rows, 5,000+ files per year) that I would like to be able to query directly using FDW. The files are genomic VCF format and I find that vcf_fdw ( https://github.com/ergo70/vcf_fdw ) works really well to read the data. We only want to be able to search the data as quickly as possible, no updates / deletes / ... I gave an example below of the basic setup and the output of explain analyze. I get the same performance if I setup the table such that the thousands of files end up in one non-partitioned table or setup each file as it's own partition of the table. I have tried increasing ( / decreasing ) the worker threads and workers, but don't see any change in the number of files open at any given time. I tried reducing the cost of parallel queries to force them to run, but can't get them to kick in. Any ideas or anything I can try? Thanks! Pat PostgreSQL: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit Multicorn: 1.3.5 VCF_FDW ( https://github.com/ergo70/vcf_fdw ) : 1.0.0 CREATE DATABASE variants; CREATE EXTENSION multicorn; CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper'); CREATE SCHEMA vcf; CREATE TABLE vcf.variants ( ..., species text, ... ) PARTITION BY LIST ( species ); CREATE FOREIGN TABLE vcf.human ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'human', suffix '.vcf.gz'); ALTER TABLE vcf.variants ATTACH PARTITION vcf.human FOR VALUES IN ( 'human' ); CREATE FOREIGN TABLE vcf.dog ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'dog', suffix '.vcf.gz'); ALTER TABLE vcf.variants ATTACH PARTITION vcf.dog FOR VALUES IN ( 'dog' ); CREATE FOREIGN TABLE vcf.cat ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'cat', suffix '.vcf.gz'); ALTER TABLE vcf.variants ATTACH PARTITION vcf.cat FOR VALUES IN ( 'cat' ); * My real data repeats this 1000+ more times EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM vcf.variants WHERE chrom = '1' AND pos = 10120 LIMIT 1000; On my real data I get the following results: -------------------------- QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=20.00..352020.00 rows=1000 width=347) (actual time=445.548..101709.307 rows=20 loops=1) -> Append (cost=20.00..3555200000.00 rows=10100000 width=347) (actual time=445.547..101709.285 rows=20 loops=1) -> Foreign Scan on dog (cost=20.00..3520000.00 rows=10000 width=352) (actual time=198.653..198.654 rows=0 loops=1) Filter: ((chrom = '1'::text) AND (pos = 10120)) -> Foreign Scan on cat (cost=20.00..3520000.00 rows=10000 width=352) (actual time=111.840..111.840 rows=0 loops=1) Filter: ((chrom = '1'::text) AND (pos = 10120)) -> Foreign Scan on human (cost=20.00..3520000.00 rows=10000 width=352) (actual time=135.050..138.534 rows=1 loops=1) Filter: ((chrom = '1'::text) AND (pos = 10120)) ... repeats many more times for each partition Planning time: 613.815 ms Execution time: 101873.880 ms (2024 rows)
On Tue, Dec 18, 2018 at 08:39:36PM -0800, Patrick Mulrooney wrote: > Wondering if anyone had any thoughts on how to tweak my setup to get it to > read many files at once instead of one at a time when using file fdw and > partitions. I found this: https://www.postgresql.org/docs/current/parallel-safety.html |The following operations are always parallel restricted. |Scans of foreign tables, unless the foreign data wrapper has an IsForeignScanParallelSafe API which indicates otherwise. https://github.com/ergo70/vcf_fdw/blob/master/vcf_fdw/__init__.py => has no such API marker, since it's couple years old, same as multicorn. Justin
Re: Increasing parallelism of queries while using file fdw and partitions
From
Patrick Mulrooney
Date:
Justin, Thanks for the idea. I pulled down the source for multicorn and added that to it. I do not see parallel queries in the analyzeoutput (unless I force it and then it only gets one worker), but it does look like it is reading more than one fileat once if I go with a non-partitioned table that looks at all the files. Not any better if I have the table split upinto partitions. So it’s better, but still curious if this would work with partitions. Thanks again. Pat > On Dec 18, 2018, at 22:51, Justin Pryzby <pryzby@telsasoft.com> wrote: > >> On Tue, Dec 18, 2018 at 08:39:36PM -0800, Patrick Mulrooney wrote: >> Wondering if anyone had any thoughts on how to tweak my setup to get it to >> read many files at once instead of one at a time when using file fdw and >> partitions. > > I found this: > > https://www.postgresql.org/docs/current/parallel-safety.html > |The following operations are always parallel restricted. > |Scans of foreign tables, unless the foreign data wrapper has an IsForeignScanParallelSafe API which indicates otherwise. > > https://github.com/ergo70/vcf_fdw/blob/master/vcf_fdw/__init__.py > => has no such API marker, since it's couple years old, same as multicorn. > > Justin