Thread: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

From

Justin Pryzby

Date:

03 April 2018, 22:46:13

TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
backing every table being queried, even for those partitions eventually
"excluded".

One of our customers (with the largest number of child tables, currently of
relkind='r') takes a long time to plan query, and I wondered whether we'd
benefit from ongoing work, if I should plan to convert to relkind='p', and if I
could start testing that for our use cases.

I found by stracing that the backend is open()ing + lseek(SEEK_END) each file
in each relation being UNIONed in this view,  including relations which were
excluded by constraints.  Their tables are partitioned by day (some are still
40+GB), so there's only 10-20 tables being included in the above query, total
size perhaps ~50GB and under 50 backing files, but accessing a LOT more files
than needed:

cold backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
  26593 open("pg_tblspc/16400
   6894 open("pg_tblspc/16401
    482 open("base/16402

warm backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
   6545 open("pg_tblspc/16401
   1338 open("pg_tblspc/16400
    254 open("base/16402

I was curious if that was improved in pg11dev, and if it applied to relkind='r'
or only to relkind='p'.  I applied v47 patches for "path toward faster
partition pruning", and was able to quickly create synthetic "huge table" using
truncate (and verified that smgr logic actually opens all 1e5 files).

$ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY range(i)" 
$ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t$a PARTITION OF t FOR VALUES FROM
($a)TO($((1+a)))";done
 
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 1G $a; done # make existing, empty files
1GB..
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in `seq 1 99`; do truncate -s 1G $a.$b; done;
done# make new 1GB tables with .n extension
 

postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;
 Limit  (cost=0.00..0.01 rows=1 width=8)
   ->  Aggregate  (cost=0.00..0.01 rows=1 width=8)
         ->  Result  (cost=0.00..0.00 rows=0 width=0)
               One-Time Filter: false
Time: 63268.474 ms (01:03.268)

=> zfs/fuse is certainly exacerbating the issue here due to context switches; but..

[pryzbyj@database postgresql]$ time cut -d\( -f1 /tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head
 100899 open
  99900 lseek
  98918 close
    118 brk

postgres is still open()+lseek() on 100k files, most of which are for relations
which were clearly excluded:

(gdb) bt
#0  0x00000032b2adb580 in __open_nocancel () from /lib64/libc.so.6
#1  0x0000000000711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at
fd.c:965
#2  0x00000000007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at
fd.c:1394
#3  0x0000000000733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, segno=10, oflags=0) at md.c:1783
#4  0x0000000000734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at md.c:918
#5  0x00000000006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, attr_widths=0x1e2cd24, pages=<value optimized out>,
tuples=0x1e2cb98,allvisfrac=<value optimized out>) at plancat.c:946
 
#6  0x00000000006b3a47 in get_relation_info (root=0x1bd9030, relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at
plancat.c:144
#7  0x00000000006b78fa in build_simple_rel (root=0x1bd9030, relid=126, parent=0x1bda578) at relnode.c:185
#8  0x00000000006b7990 in build_simple_rel (root=0x1bd9030, relid=1, parent=0x0) at relnode.c:251
#9  0x0000000000691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=<value optimized out>) at initsplan.c:121
#10 0x00000000006924af in query_planner (root=<value optimized out>, tlist=0x1d34fd8, qp_callback=0x693940
<standard_qp_callback>,qp_extra=0x7ffe0a077290) at planmain.c:138
 
#11 0x000000000069777e in grouping_planner (root=<value optimized out>, inheritance_update=false, tuple_fraction=<value
optimizedout>) at planner.c:1892
 
#12 0x0000000000698ef7 in subquery_planner (glob=<value optimized out>, parse=<value optimized out>, parent_root=<value
optimizedout>, hasRecursion=<value optimized out>, tuple_fraction=0) at planner.c:966
 
#13 0x0000000000699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, boundParams=<value optimized out>) at
planner.c:405
#14 0x0000000000739d2a in pg_plan_query (querytree=<value optimized out>, cursorOptions=<value optimized out>,
boundParams=<valueoptimized out>) at postgres.c:808
 
#15 0x00000000005a0bd6 in ExplainOneQuery (query=0x1bd9778, cursorOptions=<value optimized out>, into=0x0,
es=0x1bd8f58,queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0)
 
    at explain.c:365
#16 0x00000000005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, queryString=0x1ae0fb0 "explain SELECT
COUNT(1)FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254
 
#17 0x00000000007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM
tWHERE i=-99 LIMIT 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1bd8e90, 
 
    completionTag=0x7ffe0a077720 "") at utility.c:670
#18 0x000000000073c789 in PortalRunUtility (portal=0x1b46780, pstmt=0x1ae2050, isTopLevel=<value optimized out>,
setHoldSnapshot=<valueoptimized out>, dest=0x1bd8e90, completionTag=<value optimized out>) at pquery.c:1178
 
#19 0x000000000073d9aa in FillPortalStore (portal=0x1b46780, isTopLevel=true) at pquery.c:1038
#20 0x000000000073de80 in PortalRun (portal=0x1b46780, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x1ae2618,altdest=0x1ae2618, completionTag=0x7ffe0a077920 "") at pquery.c:768
 
#21 0x000000000073a39d in exec_simple_query (query_string=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT
1;")at postgres.c:1121
 
#22 0x000000000073b341 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1b0ac80
"postgres",username=<value optimized out>) at postgres.c:4149
 
#23 0x00000000006cd4f3 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4409
#24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4081
#25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1754
#26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1362
#27 0x00000000006477d0 in main (argc=7, argv=0x1adba20) at main.c:228

I haven't been following the development patch, so don't know if that's
expected to be improved, or a possible future improvement, or ...

Also, I wondered if it'd be preferable for mdnblocks/estimate_rel_size to
l/stat() rather than open+lseek, which can require opening large number of
rarely-queried files and mess up fd.c LRU cache.

Justin

Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

From

Amit Langote

Date:

04 April 2018, 03:26:58

Hi Justin.

On 2018/04/04 4:46, Justin Pryzby wrote:
> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
> backing every table being queried, even for those partitions eventually
> "excluded".

That's expected.  The faster pruning patch doesn't change the behavior
with respect to when the partitions' files are open()'d, which at this
point is still *before* the pruning occurs.  It just switches the method
of pruning to a faster one, whereby instead of pruning each partition
one-by-one using constraint exclusion, we only look at the parent's
partition descriptor to exclude partitions that don't satisfy the query.

Thanks,
Amit

Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

From

David Rowley

Date:

04 April 2018, 03:27:25

On 4 April 2018 at 07:46, Justin Pryzby <pryzby@telsasoft.com> wrote:
> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
> backing every table being queried, even for those partitions eventually
> "excluded".

Yeah, this will no doubt be due to the fact that we still build
RelOptInfos in the planner for all partitions, so we're still doing
get_relation_info() and estimate_rel_size() for each of those. It
might be possible to delay that call until we know we're going to need
the partition.  I imagine we wouldn't need to know the size of the
relation until after set_rel_sizes, but I've not checked.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

From

Amit Langote

Date:

04 April 2018, 03:31:11

On 2018/04/04 9:27, David Rowley wrote:
> On 4 April 2018 at 07:46, Justin Pryzby <pryzby@telsasoft.com> wrote:
>> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
>> backing every table being queried, even for those partitions eventually
>> "excluded".
> 
> Yeah, this will no doubt be due to the fact that we still build
> RelOptInfos in the planner for all partitions, so we're still doing
> get_relation_info() and estimate_rel_size() for each of those. It
> might be possible to delay that call until we know we're going to need
> the partition.  I imagine we wouldn't need to know the size of the
> relation until after set_rel_sizes, but I've not checked.

Yeah, one of the earliest patches on the "faster partition pruning" had
tried to solve this issue, but we decided it was better to come to it
after we're done dealing with just making the pruning faster.

Thanks,
Amit

Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

From

David Rowley

Date:

04 April 2018, 03:37:54

On 4 April 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/04 9:27, David Rowley wrote:
>> Yeah, this will no doubt be due to the fact that we still build
>> RelOptInfos in the planner for all partitions, so we're still doing
>> get_relation_info() and estimate_rel_size() for each of those. It
>> might be possible to delay that call until we know we're going to need
>> the partition.  I imagine we wouldn't need to know the size of the
>> relation until after set_rel_sizes, but I've not checked.
>
> Yeah, one of the earliest patches on the "faster partition pruning" had
> tried to solve this issue, but we decided it was better to come to it
> after we're done dealing with just making the pruning faster.

Yeah, it'll need to be a PG12 project now.

It would be nice to not create the RelOptInfos at all until we've
decided we need them. Perhaps we can put NULL placeholders in
simple_rel_array... the major problem with that at the moment is that
the boundinfo stuff is stored in each relation and not the parent
partitioned table, so I think pruning could only take place after each
RelOptInfo has been built.  To fix that we'd need to store relation
Oids of partitioned tables along with their boundinfo in the parent's
RelOptInfo... Anyway, no doubt we'll get a chance to think harder on
this once PG11 is out the way.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services