Thread: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
From
Justin Pryzby
Date:
TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file backing every table being queried, even for those partitions eventually "excluded". One of our customers (with the largest number of child tables, currently of relkind='r') takes a long time to plan query, and I wondered whether we'd benefit from ongoing work, if I should plan to convert to relkind='p', and if I could start testing that for our use cases. I found by stracing that the backend is open()ing + lseek(SEEK_END) each file in each relation being UNIONed in this view, including relations which were excluded by constraints. Their tables are partitioned by day (some are still 40+GB), so there's only 10-20 tables being included in the above query, total size perhaps ~50GB and under 50 backing files, but accessing a LOT more files than needed: cold backend: [pryzbyj@database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head 26593 open("pg_tblspc/16400 6894 open("pg_tblspc/16401 482 open("base/16402 warm backend: [pryzbyj@database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head 6545 open("pg_tblspc/16401 1338 open("pg_tblspc/16400 254 open("base/16402 I was curious if that was improved in pg11dev, and if it applied to relkind='r' or only to relkind='p'. I applied v47 patches for "path toward faster partition pruning", and was able to quickly create synthetic "huge table" using truncate (and verified that smgr logic actually opens all 1e5 files). $ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY range(i)" $ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t$a PARTITION OF t FOR VALUES FROM ($a)TO($((1+a)))";done $ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 1G $a; done # make existing, empty files 1GB.. $ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in `seq 1 99`; do truncate -s 1G $a.$b; done; done# make new 1GB tables with .n extension postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1; Limit (cost=0.00..0.01 rows=1 width=8) -> Aggregate (cost=0.00..0.01 rows=1 width=8) -> Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false Time: 63268.474 ms (01:03.268) => zfs/fuse is certainly exacerbating the issue here due to context switches; but.. [pryzbyj@database postgresql]$ time cut -d\( -f1 /tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head 100899 open 99900 lseek 98918 close 118 brk postgres is still open()+lseek() on 100k files, most of which are for relations which were clearly excluded: (gdb) bt #0 0x00000032b2adb580 in __open_nocancel () from /lib64/libc.so.6 #1 0x0000000000711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:965 #2 0x00000000007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:1394 #3 0x0000000000733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, segno=10, oflags=0) at md.c:1783 #4 0x0000000000734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at md.c:918 #5 0x00000000006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, attr_widths=0x1e2cd24, pages=<value optimized out>, tuples=0x1e2cb98,allvisfrac=<value optimized out>) at plancat.c:946 #6 0x00000000006b3a47 in get_relation_info (root=0x1bd9030, relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at plancat.c:144 #7 0x00000000006b78fa in build_simple_rel (root=0x1bd9030, relid=126, parent=0x1bda578) at relnode.c:185 #8 0x00000000006b7990 in build_simple_rel (root=0x1bd9030, relid=1, parent=0x0) at relnode.c:251 #9 0x0000000000691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=<value optimized out>) at initsplan.c:121 #10 0x00000000006924af in query_planner (root=<value optimized out>, tlist=0x1d34fd8, qp_callback=0x693940 <standard_qp_callback>,qp_extra=0x7ffe0a077290) at planmain.c:138 #11 0x000000000069777e in grouping_planner (root=<value optimized out>, inheritance_update=false, tuple_fraction=<value optimizedout>) at planner.c:1892 #12 0x0000000000698ef7 in subquery_planner (glob=<value optimized out>, parse=<value optimized out>, parent_root=<value optimizedout>, hasRecursion=<value optimized out>, tuple_fraction=0) at planner.c:966 #13 0x0000000000699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, boundParams=<value optimized out>) at planner.c:405 #14 0x0000000000739d2a in pg_plan_query (querytree=<value optimized out>, cursorOptions=<value optimized out>, boundParams=<valueoptimized out>) at postgres.c:808 #15 0x00000000005a0bd6 in ExplainOneQuery (query=0x1bd9778, cursorOptions=<value optimized out>, into=0x0, es=0x1bd8f58,queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0) at explain.c:365 #16 0x00000000005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, queryString=0x1ae0fb0 "explain SELECT COUNT(1)FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254 #17 0x00000000007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM tWHERE i=-99 LIMIT 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1bd8e90, completionTag=0x7ffe0a077720 "") at utility.c:670 #18 0x000000000073c789 in PortalRunUtility (portal=0x1b46780, pstmt=0x1ae2050, isTopLevel=<value optimized out>, setHoldSnapshot=<valueoptimized out>, dest=0x1bd8e90, completionTag=<value optimized out>) at pquery.c:1178 #19 0x000000000073d9aa in FillPortalStore (portal=0x1b46780, isTopLevel=true) at pquery.c:1038 #20 0x000000000073de80 in PortalRun (portal=0x1b46780, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1ae2618,altdest=0x1ae2618, completionTag=0x7ffe0a077920 "") at pquery.c:768 #21 0x000000000073a39d in exec_simple_query (query_string=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;")at postgres.c:1121 #22 0x000000000073b341 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1b0ac80 "postgres",username=<value optimized out>) at postgres.c:4149 #23 0x00000000006cd4f3 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4409 #24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4081 #25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1754 #26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1362 #27 0x00000000006477d0 in main (argc=7, argv=0x1adba20) at main.c:228 I haven't been following the development patch, so don't know if that's expected to be improved, or a possible future improvement, or ... Also, I wondered if it'd be preferable for mdnblocks/estimate_rel_size to l/stat() rather than open+lseek, which can require opening large number of rarely-queried files and mess up fd.c LRU cache. Justin
Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
From
Amit Langote
Date:
Hi Justin. On 2018/04/04 4:46, Justin Pryzby wrote: > TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file > backing every table being queried, even for those partitions eventually > "excluded". That's expected. The faster pruning patch doesn't change the behavior with respect to when the partitions' files are open()'d, which at this point is still *before* the pruning occurs. It just switches the method of pruning to a faster one, whereby instead of pruning each partition one-by-one using constraint exclusion, we only look at the parent's partition descriptor to exclude partitions that don't satisfy the query. Thanks, Amit
Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
From
David Rowley
Date:
On 4 April 2018 at 07:46, Justin Pryzby <pryzby@telsasoft.com> wrote: > TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file > backing every table being queried, even for those partitions eventually > "excluded". Yeah, this will no doubt be due to the fact that we still build RelOptInfos in the planner for all partitions, so we're still doing get_relation_info() and estimate_rel_size() for each of those. It might be possible to delay that call until we know we're going to need the partition. I imagine we wouldn't need to know the size of the relation until after set_rel_sizes, but I've not checked. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
From
Amit Langote
Date:
On 2018/04/04 9:27, David Rowley wrote: > On 4 April 2018 at 07:46, Justin Pryzby <pryzby@telsasoft.com> wrote: >> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file >> backing every table being queried, even for those partitions eventually >> "excluded". > > Yeah, this will no doubt be due to the fact that we still build > RelOptInfos in the planner for all partitions, so we're still doing > get_relation_info() and estimate_rel_size() for each of those. It > might be possible to delay that call until we know we're going to need > the partition. I imagine we wouldn't need to know the size of the > relation until after set_rel_sizes, but I've not checked. Yeah, one of the earliest patches on the "faster partition pruning" had tried to solve this issue, but we decided it was better to come to it after we're done dealing with just making the pruning faster. Thanks, Amit
Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
From
David Rowley
Date:
On 4 April 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/04 9:27, David Rowley wrote: >> Yeah, this will no doubt be due to the fact that we still build >> RelOptInfos in the planner for all partitions, so we're still doing >> get_relation_info() and estimate_rel_size() for each of those. It >> might be possible to delay that call until we know we're going to need >> the partition. I imagine we wouldn't need to know the size of the >> relation until after set_rel_sizes, but I've not checked. > > Yeah, one of the earliest patches on the "faster partition pruning" had > tried to solve this issue, but we decided it was better to come to it > after we're done dealing with just making the pruning faster. Yeah, it'll need to be a PG12 project now. It would be nice to not create the RelOptInfos at all until we've decided we need them. Perhaps we can put NULL placeholders in simple_rel_array... the major problem with that at the moment is that the boundinfo stuff is stored in each relation and not the parent partitioned table, so I think pruning could only take place after each RelOptInfo has been built. To fix that we'd need to store relation Oids of partitioned tables along with their boundinfo in the parent's RelOptInfo... Anyway, no doubt we'll get a chance to think harder on this once PG11 is out the way. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services