Thread: BUG #18909: Query creates millions of temporary files and stalls
The following bug has been logged on the website: Bug reference: 18909 Logged by: Sergey Koposov Email address: skoposov@ed.ac.uk PostgreSQL version: 15.12 Operating system: Linux, Debian 10 Description: Hi, I have a query: select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, g.source_id,pmra,pmdec,parallax,parallax_error, d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia_source as g on ( g.source_id = gaps1.source_id ) that I ran (from python client), and after it didn't finish in ~ an hour, i started to investigate, finding that the pgsql_tmp folder had 15 million temporary files from that query. If try to explain it in psql, It has this plan: ------------------------------------------------------------------------------------------------------------------------------ Gather (cost=14443102.62..111487693.90 rows=65470868 width=100) Workers Planned: 4 -> Nested Loop Left Join (cost=14442102.62..104939607.10 rows=16367717 width=100) -> Parallel Hash Left Join (cost=14442102.04..22124798.60 rows=16367717 width=60) Hash Cond: (d.objid = gaps1.original_ext_source_id) -> Parallel Seq Scan on disk_sample1 d (cost=0.00..1218371.17 rows=16367717 width=60) -> Parallel Hash (cost=10307380.24..10307380.24 rows=237862624 width=16) -> Parallel Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..10307380.24 rows=237862624 width=16) -> Index Scan using gaia_sourcex_source_id_idx on gaia_source g (cost=0.58..5.05 rows=1 width=48) Index Cond: (source_id = gaps1.source_id) JIT: Functions: 14 Options: Inlining true, Optimization true, Expressions true, Deforming true (13 rows) And if I just do create table create table disk_sample1_gaia as select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanps fmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia _source as g on ( g.source_id = gaps1.source_id ); That command finished fine without creating many temporary files. The resulting table has 65mill rows. However if I again run the query from python, I very quickly saw postgresql creating 300k+ files. At which point I interrupted the query. The way the query is run from python is using https://github.com/segasai/sqlutilpy which basically does this (using psycopg): 1) creates named cursor 2) set cursor_tuple_fraction TO 1 3) FETCH the result in 100,000 row chunks Here's the gdb trace of postgres from the situation when the pg tries to create millions of tmp files ********** #0 0x00007fcd4ce6cc8b in __libc_open64 (file=file@entry=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669", oflag=oflag@entry=578) at ../sysdeps/unix/sysv/linux/open64.c:48 #1 0x000056430b609eac in BasicOpenFilePerm (fileName=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at fd.c:1125 #2 0x000056430b60a2f3 in PathNameOpenFilePerm (fileName=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at fd.c:1603 #3 0x000056430b60a4fa in PathNameOpenFile (fileFlags=578, fileName=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669") at fd.c:1806 #4 OpenTemporaryFileInTablespace (tblspcOid=<optimized out>, rejectError=<optimized out>) at fd.c:1806 #5 0x000056430b60a6d9 in OpenTemporaryFile (interXact=interXact@entry=false) at fd.c:1743 #6 0x000056430b6082c4 in BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:204 #7 0x000056430b4ccaeb in ExecHashJoinSaveTuple (tuple=tuple@entry=0x56430c243cd8, hashvalue=<optimized out>, hashvalue@entry=1412196535, fileptr=0x7fc500720d98) at nodeHashjoin.c:1248 #8 0x000056430b4c9b3b in ExecHashTableInsert (hashtable=hashtable@entry=0x56430c2434a8, slot=slot@entry=0x56430c208178, hashvalue=hashvalue@entry=1412196535) at nodeHash.c:1714 #9 0x000056430b4ca2fc in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:188 #10 MultiExecHash (node=node@entry=0x56430c10e098) at nodeHash.c:115 #11 0x000056430b4b213e in MultiExecProcNode (node=node@entry=0x56430c10e098) at execProcnode.c:520 #12 0x000056430b4cb5dc in ExecHashJoinImpl (parallel=false, pstate=0x56430c10ca68) at nodeHashjoin.c:297 #13 ExecHashJoin (pstate=0x56430c10ca68) at nodeHashjoin.c:621 #14 0x000056430b4aa92a in ExecProcNode (node=0x56430c10ca68) at ../../../src/include/executor/executor.h:262 #15 ExecutePlan (queryDesc=queryDesc@entry=0x56430c121928, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=100000, direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x56430bfb70b8) at execMain.c:1640 #16 0x000056430b4ab03b in standard_ExecutorRun (queryDesc=0x56430c121928, direction=ForwardScanDirection, count=100000, execute_once=<optimized out>) at execMain.c:362 #17 0x00007fcd4d2556d5 in explain_ExecutorRun (queryDesc=0x56430c121928, direction=ForwardScanDirection, count=100000, execute_once=<optimized out>) at auto_explain.c:322 #18 0x000056430b641989 in PortalRunSelect (portal=0x56430c062378, forward=<optimized out>, count=100000, dest=<optimized out>) at pquery.c:922 #19 0x000056430b643356 in PortalRunFetch (portal=0x56430c062378, fdirection=FETCH_FORWARD, count=100000, dest=dest@entry=0x56430bfb70b8) at pquery.c:1425 #20 0x000056430b45ad19 in PerformPortalFetch (stmt=stmt@entry=0x56430bf5e710, dest=dest@entry=0x56430bfb70b8, qc=qc@entry=0x7ffcce70ee00) at portalcmds.c:198 #21 0x000056430b644462 in standard_ProcessUtility (pstmt=0x56430bf5ea30, queryString=0x56430bf5dd08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x56430bfb70b8, qc=0x7ffcce70ee00) at utility.c:710 #22 0x000056430b641e90 in PortalRunUtility (portal=portal@entry=0x56430c062268, pstmt=0x56430bf5ea30, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x56430bfb70b8, qc=qc@entry=0x7ffcce70ee00) at pquery.c:1156 #23 0x000056430b642517 in FillPortalStore (portal=portal@entry=0x56430c062268, isTopLevel=isTopLevel@entry=true) at ../../../src/include/nodes/pg_list.h:260 #24 0x000056430b64318d in PortalRun (portal=portal@entry=0x56430c062268, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x56430bf5eb10, altdest=altdest@entry=0x56430bf5eb10, qc=0x7ffcce70eff0) at pquery.c:761 #25 0x000056430b63ee91 in exec_simple_query (query_string=0x56430bf5dd08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"") at postgres.c:1254 #26 0x000056430b63fbd5 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4691 #27 0x000056430b5ac569 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4515 #28 BackendStartup (port=<optimized out>) at postmaster.c:4243 #29 ServerLoop () at postmaster.c:1811 --Type <RET> for more, q to quit, c to continue without paging-- #30 0x000056430b5ad627 in PostmasterMain (argc=3, argv=0x56430bf574f0) at postmaster.c:1483 #31 0x000056430b2f2d84 in main (argc=3, argv=0x56430bf574f0) at main.c:204 ************* Interesting when I tried to run the same query when disabling jit, i.e. set jit_above_cost=-1 before the query, then the query finished fine without creating millions of tmp files Also, I tried to run the query just in psql. ( I didn't finish it, since it's 60+ mil rows), but I think it works fine, i don't see explosion in number of tmp files. , so I think the fact that the named cursor was used is important. I'm happy to provide more info if I can/ if needed. Thanks, Sergey
PG Bug reporting form <noreply@postgresql.org> writes: > I have a query: ... > that I ran (from python client), and after it didn't finish in ~ an hour, i > started to investigate, finding that the pgsql_tmp folder had 15 million > temporary files from that query. Does it behave any better if you disable parallelism (set max_parallel_workers_per_gather = 0) ? regards, tom lane
On Fri, 2025-05-02 at 18:12 -0400, Tom Lane wrote: > PG Bug reporting form <noreply@postgresql.org> writes: > > I have a query: ... > > that I ran (from python client), and after it didn't finish in ~ an hour, i > > started to investigate, finding that the pgsql_tmp folder had 15 million > > temporary files from that query. > > Does it behave any better if you disable parallelism > (set max_parallel_workers_per_gather = 0) ? Yes, I've just tried and that also prevents the issue. S The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
If helpful here's the bt full during the problem (pg was compiled with -O0 -g3 -ggdb) #0 0x00007f229bd96c8b in __libc_open64 (file=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548", oflag=578) at ../sysdeps/unix/sysv/linux/open64.c:48 resultvar = 495 sc_ret = <optimized out> mode = 384 #1 0x00005615d86aef93 in BasicOpenFilePerm (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548", fileFlags=578, fileMode=384) at fd.c:1125 fd = 22037 __func__ = "BasicOpenFilePerm" #2 0x00005615d86af8d6 in PathNameOpenFilePerm (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548", fileFlags=578, fileMode=384) at fd.c:1603 fnamecopy = 0x56173e322160 "base/pgsql_tmp/pgsql_tmp35816.189548" file = 190724 vfdP = 0x7f1a5539a8f0 __func__ = "PathNameOpenFilePerm" #3 0x00005615d86af822 in PathNameOpenFile (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548", fileFlags=578) at fd.c:1568 No locals. #4 0x00005615d86afcea in OpenTemporaryFileInTablespace (tblspcOid=1663, rejectError=true) at fd.c:1806 tempdirpath = "base/pgsql_tmp\000\000\070\231L\373\377\177\000\000\240\037\000\000\377\377", '\000' <repeats 89 times>,"\200\377?", '\000' <repeats 13 times>, "\200\206\373p\330\025V\000\000\000\000\000\000\000\000\000\000\f\000\000\000\000\000\000\000X\240L\373\377\177\000\000@\300b\332\025V\000\000РL\373\377\177\000\ 000\305\003q\330\025V\000\000\253t\334<\000\000\000\000G"... tempfilepath = "base/pgsql_tmp/pgsql_tmp35816.189548\000V\000\000\022\000\000\000\000\000\000\000\220\235L\373\377\177\000\000\340l\322g\032\177\000\000\002", '\000'<repeats 63 times>, "t\234L\373\377\177\000\000x\234L\373\377\177\000\000_\244?$<\327\002\000\205\303\r\004<\327\002\000\000\000\000\000\000\000\000\000\332\340\061 \000\000\000\000\260\234L\373\377\177\000\000\060\227\212\330\025V\000\000\200\234L\373\377\177\000\000"... file = 22037 __func__ = "OpenTemporaryFileInTablespace" #5 0x00005615d86afb6e in OpenTemporaryFile (interXact=false) at fd.c:1743 file = 0 #6 0x00005615d86ac995 in BufFileCreateTemp (interXact=false) at buffile.c:204 file = 0x100000000000003 pfile = -1879610502 #7 0x00005615d84fb31c in ExecHashJoinSaveTuple (tuple=0x5615da85f5e8, hashvalue=2415356794, fileptr=0x7f1a4f685a80) at nodeHashjoin.c:1248 file = 0x0 #8 0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378, hashvalue=2415356794) at nodeHash.c:1714 shouldFree = true tuple = 0x5615da85f5e8 bucketno = 32992122 batchno = 3521863 #9 0x00005615d84f3fb1 in MultiExecPrivateHash (node=0x5615da6c2f18) at nodeHash.c:188 bucketNumber = -1 outerNode = 0x5615da6c2638 hashkeys = 0x5615da85ba48 hashtable = 0x5615da85e5c0 slot = 0x5615da823378 econtext = 0x5615da7a9c28 hashvalue = 2415356794 #10 0x00005615d84f3e63 in MultiExecHash (node=0x5615da6c2f18) at nodeHash.c:115 No locals. #11 0x00005615d84d532b in MultiExecProcNode (node=0x5615da6c2f18) at execProcnode.c:520 result = 0x1 __func__ = "MultiExecProcNode" #12 0x00005615d84f9ca8 in ExecHashJoinImpl (pstate=0x5615da6c18e8, parallel=false) at nodeHashjoin.c:297 node = 0x5615da6c18e8 outerNode = 0x5615da6c1bd8 hashNode = 0x5615da6c2f18 joinqual = 0x0 otherqual = 0x0 econtext = 0x5615da6c1af8 hashtable = 0x5615da85e5c0 outerTupleSlot = 0x0 hashvalue = 0 batchno = 0 parallel_state = 0x0 __func__ = "ExecHashJoinImpl" #13 0x00005615d84fa350 in ExecHashJoin (pstate=0x5615da6c18e8) at nodeHashjoin.c:621 No locals. #14 0x00005615d84d5241 in ExecProcNodeFirst (node=0x5615da6c18e8) at execProcnode.c:464 No locals. #15 0x00005615d84c9dc3 in ExecProcNode (node=0x5615da6c18e8) at ../../../src/include/executor/executor.h:262 No locals. #16 0x00005615d84cc539 in ExecutePlan (queryDesc=0x5615da6d6cc8, operation=CMD_SELECT, sendTuples=true, numberTuples=100000, direction=ForwardScanDirection, dest=0x5615da56e0d8) at execMain.c:1640 estate = 0x5615da6c1688 planstate = 0x5615da6c18e8 use_parallel_mode = false slot = 0x5615da56e0d8 current_tuple_count = 0 #17 0x00005615d84ca298 in standard_ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection, count=100000, execute_once=false) at execMain.c:362 estate = 0x5615da6c1688 operation = CMD_SELECT dest = 0x5615da56e0d8 sendTuples = true oldcontext = 0x5615da6c3580 #18 0x00007f229c17f6d5 in explain_ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection, count=100000, execute_once=<optimized out>) at auto_explain.c:322 _save_exception_stack = 0x7ffffb4ca900 _save_context_stack = 0x0 _local_sigjmp_buf = {{ __jmpbuf = {1, -1720014361324494981, 94651820578848, 140737409496096, 0, 0, -1720014361087516805, -1613547901269909637}, __mask_was_saved = 0, __saved_mask = { __val = {0, 94651857416032, 0, 0, 0, 0, 0, 140737409492960, 94651827305323, 24, 94651857416032, 0, 94651857416456, 140737409493008, 94651827395998, 4294967296} } }} _do_rethrow = false #19 0x00005615d84ca18b in ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection, count=100000, execute_once=false) at execMain.c:309 No locals. #20 0x00005615d86ed0a7 in PortalRunSelect (portal=0x5615da616338, forward=true, count=100000, dest=0x5615da56e0d8) at pquery.c:922 queryDesc = 0x5615da6d6cc8 direction = ForwardScanDirection nprocessed = 0 __func__ = "PortalRunSelect" #21 0x00005615d86ee021 in DoPortalRunFetch (portal=0x5615da616338, fdirection=FETCH_FORWARD, count=100000, dest=0x5615da56e0d8) at pquery.c:1674 forward = true __func__ = "DoPortalRunFetch" #22 0x00005615d86eda7f in PortalRunFetch (portal=0x5615da616338, fdirection=FETCH_FORWARD, count=100000, dest=0x5615da56e0d8) at pquery.c:1425 _save_exception_stack = 0x7ffffb4caca0 _save_context_stack = 0x0 _local_sigjmp_buf = {{ __jmpbuf = {1, -1720014361253191813, 94651820578848, 140737409496096, 0, 0, -1720014361318203525, -4902996909213425797}, __mask_was_saved = 0, __saved_mask = { __val = {2786264749430538240, 94651857065712, 94651858125368, 140737409493456, 64, 94651827132028, 94651858128616, 7, 0, 7, 94651858125512, 94651858128672, 94651858131912, 140737409493472, 94651827135315,0} } }} _do_rethrow = false result = 94651858117432 saveActivePortal = 0x5615da616228 saveResourceOwner = 0x5615da571c20 savePortalContext = 0x5615da56df90 oldContext = 0x5615da56df90 __func__ = "PortalRunFetch" #23 0x00005615d84553a9 in PerformPortalFetch (stmt=0x5615da515710, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at portalcmds.c:198 portal = 0x5615da616338 nprocessed = 140737409493584 __func__ = "PerformPortalFetch" #24 0x00005615d86eec72 in standard_ProcessUtility (pstmt=0x5615da515a30, queryString=0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at utility.c:710 parsetree = 0x5615da515710 isTopLevel = true isAtomicContext = true pstate = 0x5615da56e168 readonly_flags = 5 __func__ = "standard_ProcessUtility" #25 0x00005615d86ee768 in ProcessUtility (pstmt=0x5615da515a30, queryString=0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at utility.c:530 No locals. #26 0x00005615d86ed654 in PortalRunUtility (portal=0x5615da616228, pstmt=0x5615da515a30, isTopLevel=true, setHoldSnapshot=true, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at pquery.c:1156 No locals. #27 0x00005615d86ed3b7 in FillPortalStore (portal=0x5615da616228, isTopLevel=true) at pquery.c:1029 treceiver = 0x5615da56e0d8 qc = { commandTag = CMDTAG_UNKNOWN, nprocessed = 0 } __func__ = "FillPortalStore" #28 0x00005615d86ecd74 in PortalRun (portal=0x5615da616228, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5615da515b10, altdest=0x5615da515b10, qc=0x7ffffb4cae00) at pquery.c:761 _save_exception_stack = 0x7ffffb4caf20 _save_context_stack = 0x0 _local_sigjmp_buf = {{ __jmpbuf = {0, -1720014361414672517, 94651820578848, 140737409496096, 0, 0, -1720014361467101317, -4902996908842098821}, __mask_was_saved = 0, __saved_mask = { __val = {20843699160, 94651857066752, 128, 94651857062896, 94651857428368, 94651857065712, 4336, 94651857428632, 94651857063120, 140737409494400, 94651827307209, 7, 112, 94651857066880, 112, 4216106384} } }} _do_rethrow = false result = false nprocessed = 94651857066768 saveTopTransactionResourceOwner = 0x5615da5774a8 saveTopTransactionContext = 0x5615da56af60 saveActivePortal = 0x0 saveResourceOwner = 0x5615da5774a8 savePortalContext = 0x0 saveMemoryContext = 0x5615da56af60 __func__ = "PortalRun" #29 0x00005615d86e6808 in exec_simple_query (query_string=0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"") at postgres.c:1254 snapshot_set = false per_parsetree_context = 0x0 plantree_list = 0x5615da515ac0 parsetree = 0x5615da515740 commandTag = CMDTAG_FETCH qc = { commandTag = CMDTAG_UNKNOWN, nprocessed = 0 } querytree_list = 0x5615da5159e0 portal = 0x5615da616228 receiver = 0x5615da515b10 format = 0 parsetree_item__state = { l = 0x5615da515770, i = 0 } dest = DestRemote oldcontext = 0x5615da56af60 parsetree_list = 0x5615da515770 parsetree_item = 0x5615da515788 save_log_statement_stats = false was_logged = false use_implicit_block = false msec_str = "\bMQ\332\025V\000\000)\000\000\000\006\000\000\000p\256L\373\377\177\000\000\333\070~\000)\000\000" __func__ = "exec_simple_query" #30 0x00005615d86eb14c in PostgresMain (dbname=0x5615da5703e8 "wsdb", username=0x5615da510588 "koposov") at postgres.c:4691 query_string = 0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"" firstchar = 81 input_message = { data = 0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", len = 42, maxlen = 1024, cursor = 42 } local_sigjmp_buf = {{ __jmpbuf = {0, -1720014361601319045, 94651820578848, 140737409496096, 0, 0, -1720014361379020933, -4902996912041042053}, __mask_was_saved = 1, __saved_mask = { __val = {4194304, 18446744073709551536, 0, 0, 139786617312675, 0, 0, 140737409495072, 94651820578848, 140737409496096, 94651827270967, 129792, 94651857036688, 8248, 42949672960, 94651857748440} } }} send_ready_for_query = false idle_in_transaction_timeout_enabled = false idle_session_timeout_enabled = false __func__ = "PostgresMain" #31 0x00005615d8638aa1 in BackendRun (port=0x5615da56a760) at postmaster.c:4515 No locals. #32 0x00005615d86383ed in BackendStartup (port=0x5615da56a760) at postmaster.c:4243 bn = 0x5615da567980 pid = 0 __func__ = "BackendStartup" #33 0x00005615d8634a2f in ServerLoop () at postmaster.c:1811 port = 0x5615da56a760 i = 0 rmask = { fds_bits = {64, 0 <repeats 15 times>} } selres = 1 now = 1746287454 readmask = { fds_bits = {4032, 0 <repeats 15 times>} } nSockets = 12 last_lockfile_recheck_time = 1746287454 last_touch_time = 1746287454 __func__ = "ServerLoop" #34 0x00005615d8634288 in PostmasterMain (argc=3, argv=0x5615da50e4f0) at postmaster.c:1483 opt = -1 status = 0 userDoption = 0x5615da531d40 "/mnt/bigdata/pgdata15" listen_addr_saved = true i = 64 output_config_variable = 0x0 __func__ = "PostmasterMain" #35 0x00005615d8544c28 in main (argc=3, argv=0x5615da50e4f0) at main.c:204 do_check_root = true quit Please answer y or n. Detaching from program: /opt/pgsql15/bin/postgres, process 35816 [Inferior 1 (process 35816) detached] S The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > #8 0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378, hashvalue=2415356794) > at nodeHash.c:1714 > shouldFree = true > tuple = 0x5615da85f5e8 > bucketno = 32992122 > batchno = 3521863 Yeah, this confirms the idea that the hashtable has exploded into an unreasonable number of buckets and batches. I don't know why a parallel hash join would be more prone to do that than a non-parallel one, though. I'm hoping some of the folks who worked on PHJ will look at this. What have you got work_mem set to? I hope it's fairly large, if you need to join such large tables. regards, tom lane
On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote: > Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > > #8 0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378, hashvalue=2415356794) > > at nodeHash.c:1714 > > shouldFree = true > > tuple = 0x5615da85f5e8 > > bucketno = 32992122 > > batchno = 3521863 > > Yeah, this confirms the idea that the hashtable has exploded into an > unreasonable number of buckets and batches. I don't know why a > parallel hash join would be more prone to do that than a non-parallel > one, though. I'm hoping some of the folks who worked on PHJ will > look at this. > Thanks > What have you got work_mem set to? I hope it's fairly large, if > you need to join such large tables. > Here're my memory settings shared_buffers = 32GB work_mem = 1GB S The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
On 3/5/2025 18:52, Sergey Koposov wrote: > On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote: >> Yeah, this confirms the idea that the hashtable has exploded into an >> unreasonable number of buckets and batches. I don't know why a >> parallel hash join would be more prone to do that than a non-parallel >> one, though. I'm hoping some of the folks who worked on PHJ will >> look at this. > Here're my memory settings > > shared_buffers = 32GB > work_mem = 1GB May you show statistics from the pg_statistic table on two columns: "d.objid" and "gaps1.original_ext_source_id"? At least stanullfrac, stadistinct, stawidth and MCV frequencies. Also, an EXPLAIN ANALYZE on this query, which successfully finished execution without parallel workers, might provide quick insights. -- regards, Andrei Lepikhov
On Sat, 2025-05-03 at 19:56 +0200, Andrei Lepikhov wrote: > [You don't often get email from lepihov@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification] > > On 3/5/2025 18:52, Sergey Koposov wrote: > > On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote: > > > Yeah, this confirms the idea that the hashtable has exploded into an > > > unreasonable number of buckets and batches. I don't know why a > > > parallel hash join would be more prone to do that than a non-parallel > > > one, though. I'm hoping some of the folks who worked on PHJ will > > > look at this. > > Here're my memory settings > > > > shared_buffers = 32GB > > work_mem = 1GB > May you show statistics from the pg_statistic table on two columns: > "d.objid" and "gaps1.original_ext_source_id"? At least stanullfrac, > stadistinct, stawidth and MCV frequencies. > > Also, an EXPLAIN ANALYZE on this query, which successfully finished > execution without parallel workers, might provide quick insights. > For some reason, when running in psql with wsdb=> set max_parallel_workers_per_gather =0; SET wsdb=> explain analyze select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag fromdisk_sample1 as d left join gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia_sourceas g on ( g.source_id = gaps1.source_id ) ; it also triggered the issue, so it didn't finish. (I don't quite know why I did the same thing earlier in python it worked fine, unless I made a mistake somehow or the issue depends on which parts of the table are in the buffers) And here are the statistics: table_name | column_name | stanullfrac | stadistinct | stawidth | stakind1 | stavalues1 | stakind2 | stavalues2 | stakind3 | stavalues3 | stakind4 | stavalues4 | stakind5 | stavalues5 ---------------------------------------+------------------------+-------------+-------------+----------+----------+---------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------+----------+------------+----------+------------+----------+------ ------+----------+------------ gaia_edr3_aux.panstarrs1bestneighbour | original_ext_source_id | 0 | -1 | 8 | 2 | {65821194190923049,69822775194516201,71021357366914725,71622558219298246,72242570679316167,72752715133276582,73292555331458103,73722640104185638,7431271120960725 3,74872844134480919,75461207930286818,75932867164620064,76482380938670301,77062576694809476,77652902237497049,78262718311176992,78892794541132028,795515044225764 94,80112867400946845,80772802603258902,81391421588335444,81971412486350795,82631186975787792,83252787960950269,83972820164120115,84752904153083118,85512619880528 302,86202647938176936,86842598063633314,87602612520706466,88421075961167122,89172651209793104,89952620931199224,90722572791560233,91650973331643328,9261269053096 6902,93602857051441783,94782658946712918,95712666148671194,96952828393314703,97932742026524718,98862853134788904,99802966167329820,100902836681868414,10190281236 5694460,103031292324379499,104431512205175644,105751238115030863,106953048414271619,108123145169985425,109550825872818887,111053588196313311,112282896513802828,1 13570942665251677,115022939740948640,116382768852523070,117662904386441014,118932892636732902,120190477050925736,121420948001414218,122442964708121838,1238132188 80751231,124992869043914348,126310843575958592,127562955440231806,128872831401617836,130233203538192704,131521007626061188,132613587058728897,134002801572489756, 135283031853672449,136872614522998101,138313019539925635,140112936758215218,141542930246066222,142983169221289589,144482954390781393,146082964742924866,147612878 246882308,149172975717758062,150602803315151189,152220866064652863,154243076414126838,156152890996068633,158242935213542742,160152946363967092,162193040985483908 ,164030230662105236,166071022961061724,167933557037565204,169750502017714204,171783528800592714,173633125479747431,175841940675015146,178023541392200467,18008076 9816969530,182390090370071085,185103234223553859,189060291309395827,194410132668322056,213381935620486100} | 3 | | 0 | | 0 | | 0 | koposov.disk_sample1 | objid | 0 | -1 | 8 | 2 | {80171356131152995,85481367757687896,88191325798310474,90321328433852673,92321466354661847,94201275367376529,95701503912853131,97161484423571822,9853144852720894 4,99921427833159308,101151276638551505,102331296875522814,103501223994628897,104551265261567639,105761212996488657,106981229573480759,108151464449806724,10952135 9876084906,110721535047532362,112031293425751247,112991512837345375,114041353644327525,115181393912959844,116271438699167868,117361157875578842,11846119946421047 5,119651383957806910,120691403987052979,121971437975375438,122991360553450551,124081345165940966,125561127251039774,126951413062522443,128261239226290575,1294911 91145884395,130701198797726914,131881390246784832,133071309574028386,134441245837162400,135561325470268624,136831168062713109,138031312799349329,1392611655099586 83,140541093855172591,141721073205266546,142881401626551876,144371304279438040,145581082546437325,147031098350718441,148451143937693348,149711230639891464,151101 129058019814,152461282899556445,153601374120407807,154871110028410102,156201081651853270,157471169532236664,158711380905398833,160070980700916736,161581282957436 146,163001203841617369,164491232859199607,165710958822329339,167161342475229839,168720942762011833,170101108127616821,171491208376278158,173061183798016157,17436 1022423370078,175860958433513392,177241069877220468,178661417914144713,180101182484225880,181550985775752317,183000983447406349,184530819926849727,18598116819371 2346,187450900510134237,188911315124166293,190320741585574970,191851126057417156,193291748947587292,194551520693837798,195430932247712984,196451520400227653,1976 70906570737306,198601716301982705,199510491344039825,200363325766562309,201183158865495520,201921012051408117,202533343632192769,203143424629645681,2039131989607 26463,204762033978144872,205573304352815754,206650175134906301,207582854953296137,208920557060736946,210870029621871940,215863110451068650} | 3 | | 0 | | 0 | | 0 | (2 rows) Basically most of ids should be distinct but I would expect most objects form disk_sample1 have a 1 to 1 match in the gaps1table Sergey The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
And this is explain analyze run that finishes in psql, with the default settings (i.e. with parallelism) : QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------- Gather (cost=14443102.62..111487693.90 rows=65470868 width=100) (actual time=212912.377..334892.676 rows=65514296 loops=1) Workers Planned: 4 Workers Launched: 4 -> Nested Loop Left Join (cost=14442102.62..104939607.10 rows=16367717 width=100) (actual time=209806.019..317684.146rows=13102859 loops=5) -> Parallel Hash Left Join (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual time=209805.943..273008.489rows=13102859 loops=5) Hash Cond: (d.objid = gaps1.original_ext_source_id) -> Parallel Seq Scan on disk_sample1 d (cost=0.00..1218371.17 rows=16367717 width=60) (actual time=37.353..25185.340rows=13095751 loops=5) -> Parallel Hash (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual time=169633.067..169633.068rows=190290095 loops=5) Buckets: 67108864 Batches: 32 Memory Usage: 1919904kB -> Parallel Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..10307380.24 rows=237862624 width=16)(actual time=132.295..117548.803 rows=190290095 loops=5) -> Index Scan using gaia_sourcex_source_id_idx on gaia_source g (cost=0.58..5.05 rows=1 width=48) (actual time=0.003..0.003rows=0 loops=65514296) Index Cond: (source_id = gaps1.source_id) Planning Time: 1.266 ms JIT: Functions: 75 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 11.796 ms, Inlining 293.374 ms, Optimization 81.354 ms, Emission 173.338 ms, Total 559.861 ms Execution Time: 337814.695 ms (18 rows) And I did verify again that when I query through python (i.e. using the cursor) with max_parallel_workers_per_gather=0 it finishes fine. Also I clearly see that when I query through the cursor in python with the default settings (and when I see the issue), itclearly uses a different plan, as opposed to just running the query in psql. Because when running in psql I see these kinds of files in tmp pgsql_tmp75270.0.fileset as opposed to : pgsql_tmp73459.0 ... I don't think I know how to see the plan of the declare cursor query. So summarizing: the query produces millions of files in 1) query through cursor with default settings (max_parallel_workers_per_gather=4) 2) query through psql with no parallelism(max_parallel_workers_per_gather=0) it works 3) query through cursor with no parallelism (max_parallel_workers_per_gather=0) 4) query through psql with default settings (max_parallel_workers_per_gather=4) I hope it makes sense to someone & helps. Sergey The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > -> Parallel Hash Left Join (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual time=209805.943..273008.489rows=13102859 loops=5) > Hash Cond: (d.objid = gaps1.original_ext_source_id) > -> Parallel Seq Scan on disk_sample1 d (cost=0.00..1218371.17 rows=16367717 width=60) (actual time=37.353..25185.340rows=13095751 loops=5) > -> Parallel Hash (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual time=169633.067..169633.068rows=190290095 loops=5) > Buckets: 67108864 Batches: 32 Memory Usage: 1919904kB > -> Parallel Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..10307380.24 rows=237862624 width=16)(actual time=132.295..117548.803 > rows=190290095 loops=5) Hm, interesting. The number of batches stayed sane here (32), whereas it went crazy in the other run. I wonder if there's something nondeterministic about that choice in a parallel hash join. > I don't think I know how to see the plan of the declare cursor query. EXPLAIN DECLARE c CURSOR FOR SELECT ... regards, tom lane
On Sat, 2025-05-03 at 18:03 -0400, Tom Lane wrote: > Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > > -> Parallel Hash Left Join (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual time=209805.943..273008.489rows=13102859 loops=5) > > Hash Cond: (d.objid = gaps1.original_ext_source_id) > > -> Parallel Seq Scan on disk_sample1 d (cost=0.00..1218371.17 rows=16367717 width=60) (actual time=37.353..25185.340rows=13095751 loops=5) > > -> Parallel Hash (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual time=169633.067..169633.068rows=190290095 loops=5) > > Buckets: 67108864 Batches: 32 Memory Usage: 1919904kB > > -> Parallel Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..10307380.24 rows=237862624 width=16)(actual > > time=132.295..117548.803 > > rows=190290095 loops=5) > > Hm, interesting. The number of batches stayed sane here (32), whereas > it went crazy in the other run. I wonder if there's something > nondeterministic about that choice in a parallel hash join. > > > I don't think I know how to see the plan of the declare cursor query. > > EXPLAIN DECLARE c CURSOR FOR SELECT ... Thanks! I tried something like that before, but without the cursor name. Here's the plan for the case that's originally triggered it: Note I did not disable the parallelism here by hand. wsdb=> set cursor_tuple_fraction TO 1; SET wsdb=> EXPLAIN DECLARE c CURSOR FOR select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia_sourceas g on ( g.source_id = gaps1.source_id ) ; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Hash Right Join (cost=112529534.06..422118114.74 rows=65470868 width=100) Hash Cond: (g.source_id = gaps1.source_id) -> Seq Scan on gaia_source g (cost=0.00..124645680.12 rows=1811786112 width=48) -> Hash (cost=111007847.21..111007847.21 rows=65470868 width=60) -> Hash Right Join (cost=3231089.53..111007847.21 rows=65470868 width=60) Hash Cond: (gaps1.original_ext_source_id = d.objid) -> Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..17443258.96 rows=951450496 width=16) -> Hash (cost=1709402.68..1709402.68 rows=65470868 width=60) -> Seq Scan on disk_sample1 d (cost=0.00..1709402.68 rows=65470868 width=60) JIT: Functions: 18 Options: Inlining true, Optimization true, Expressions true, Deforming true (12 rows) S The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > wsdb=> set cursor_tuple_fraction TO 1; > SET > wsdb=> EXPLAIN DECLARE c CURSOR FOR select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, > g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join > gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia_sourceas g on ( g.source_id = gaps1.source_id > ) ; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------- > Hash Right Join (cost=112529534.06..422118114.74 rows=65470868 width=100) > Hash Cond: (g.source_id = gaps1.source_id) > -> Seq Scan on gaia_source g (cost=0.00..124645680.12 rows=1811786112 width=48) > -> Hash (cost=111007847.21..111007847.21 rows=65470868 width=60) > -> Hash Right Join (cost=3231089.53..111007847.21 rows=65470868 width=60) > Hash Cond: (gaps1.original_ext_source_id = d.objid) > -> Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..17443258.96 rows=951450496 width=16) > -> Hash (cost=1709402.68..1709402.68 rows=65470868 width=60) > -> Seq Scan on disk_sample1 d (cost=0.00..1709402.68 rows=65470868 width=60) Oh, *that* is interesting: now we have two hash joins, and neither one has panstarrs1bestneighbour as the table to hash, so it's not too clear which one is going crazy. But you showed the stats for disk_sample1.objid, and that looked pretty well distributed, so I'm going to guess that that hash is fine. That leaves the other join on panstarrs1bestneighbour.source_id as the one under suspicion. Can we see the stats for that column? regards, tom lane
On Sat, 2025-05-03 at 18:47 -0400, Tom Lane wrote: > Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes: > > wsdb=> set cursor_tuple_fraction TO 1; > > SET > > wsdb=> EXPLAIN DECLARE c CURSOR FOR select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag, > > g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join > > gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia_sourceas g on ( g.source_id = > > gaps1.source_id > > ) ; > > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------- > > Hash Right Join (cost=112529534.06..422118114.74 rows=65470868 width=100) > > Hash Cond: (g.source_id = gaps1.source_id) > > -> Seq Scan on gaia_source g (cost=0.00..124645680.12 rows=1811786112 width=48) > > -> Hash (cost=111007847.21..111007847.21 rows=65470868 width=60) > > -> Hash Right Join (cost=3231089.53..111007847.21 rows=65470868 width=60) > > Hash Cond: (gaps1.original_ext_source_id = d.objid) > > -> Seq Scan on panstarrs1bestneighbour gaps1 (cost=0.00..17443258.96 rows=951450496 width=16) > > -> Hash (cost=1709402.68..1709402.68 rows=65470868 width=60) > > -> Seq Scan on disk_sample1 d (cost=0.00..1709402.68 rows=65470868 width=60) > > Oh, *that* is interesting: now we have two hash joins, and neither one > has panstarrs1bestneighbour as the table to hash, so it's not too > clear which one is going crazy. But you showed the stats for > disk_sample1.objid, and that looked pretty well distributed, so I'm > going to guess that that hash is fine. That leaves the other join > on panstarrs1bestneighbour.source_id as the one under suspicion. > Can we see the stats for that column? > I don't quite know what you are looking for here, but this is the panstarrs1bestneighbour.source_id. Those are object ids for ~ a billion objects, so basically I expect most of them to be different, which is matches my expectation. table_name | column_name | stanullfrac | stadistinct | stawidth | stakind1 | stavalues1 | stakind2 | stavalues2 | stakind3 | stavalues3 | stakind4 | stavalues4 | stakind5 | stavalues5 ---------------------------------------+-------------+-------------+-------------+----------+----------+--------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------+----------+------------+----------+------------+----------+------------+----------+------------ gaia_edr3_aux.panstarrs1bestneighbour | source_id | 0 | -1 | 8 | 2 | {806904097912320,181280403379521536,231030517101341952,279340244824304640,395434550873163776,430064940901112960,465200384719646464,511243945963340288,55485098513 0936448,884099024809719424,1088756926588540928,1368729867335745024,1743677218036649344,1806876076961984000,1822446535998227072,1827513777076336640,18434769335570 31936,1864935449158255104,1920817367429277568,1968055999039378816,1987254777719980160,2004190452249389056,2020732153709040512,2027928972718120832,203348532469370 5856,2045418530529014272,2059732346543415936,2074956120121436416,2100442043737863296,2163620222393133568,2179506511844570496,2206141459472625664,2252050369194606 336,2591598075455078272,2906007067548492032,2949458190829542272,3032992517301187328,3068094632176092416,3120910944525529984,3215924142328292992,33363926191583658 24,3378822326196932864,3442282147328280064,3663373148030165504,4046707917046071296,4050407357085387776,4052784814808650880,4058733516320497280,406154888016680512 0,4064082807792646528,4067157871310616320,4070490250503721728,4075824943460652416,4079541052298279552,4089451977102617600,4091947563572467328,4096492394857354624 ,4101247370255732864,4104701520432256512,4108024244563444608,4111583191923415552,4116149772934806784,4118390921289286016,4121139459806669056,4124851857730469120, 4135313745180553728,4145067478499619328,4150596918151346688,4157604522387932416,4169873659086912896,4186656398413699456,4202234455318397824,4208158108512840448,4 237268606566890112,4252491997962068864,4256170723281997056,4262931890855102848,4268558164899557376,4283472009313681920,4291562353308821376,4298242676736372736,43 08513249218792320,4314497062051819904,4320919893590099456,4369138185748815360,4459756193193297920,4481744016273417216,4507165210444471936,4516588295676538240,453 3771875708241024,4602125356804206464,5598986647730081280,5639341335818720640,5699272347395052928,5980496601182037504,6031590184872417536,6046668097063710464,6251 266251555253888,6755724660278190464,6794820098303412096,6917184811783233280} | 3 | | 0 | | 0 | | 0 | (1 row) The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can be relevant. S The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
On 4/5/2025 01:05, Sergey Koposov wrote: > The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can be relevant. Seems promising. May you show us how much NULLs generates underlying JOIN operator. I guess, in your example the query should be close to the following: SELECT count(*) FROM disk_sample1 d LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 ON (gaps1.original_ext_source_id = d.objid) WHERE gaps1.original_ext_source_id IS NULL; And just to be sure, show us the number of NULLs that may be generated by alternative LEFT JOIN: SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 LEFT JOIN gaia_edr3.gaia_source AS g ON (g.source_id = gaps1.source_id) WHERE g.source_id IS NULL; Also, show please the current value of the GUC hash_mem_multiplier. -- regards, Andrei Lepikhov
On Sun, 2025-05-04 at 09:52 +0200, Andrei Lepikhov wrote: > On 4/5/2025 01:05, Sergey Koposov wrote: > > The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can berelevant. > Seems promising. May you show us how much NULLs generates underlying > JOIN operator. I guess, in your example the query should be close to the > following: > > SELECT count(*) FROM disk_sample1 d > LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 > ON (gaps1.original_ext_source_id = d.objid) > WHERE gaps1.original_ext_source_id IS NULL; > > And just to be sure, show us the number of NULLs that may be generated > by alternative LEFT JOIN: > > SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 > LEFT JOIN gaia_edr3.gaia_source AS g > ON (g.source_id = gaps1.source_id) > WHERE g.source_id IS NULL; > > Also, show please the current value of the GUC hash_mem_multiplier. > Here: wsdb=> SELECT count(*) FROM disk_sample1 d LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 ON (gaps1.original_ext_source_id = d.objid) WHERE gaps1.original_ext_source_id IS NULL; count ---------- 37635427 (1 row) wsdb=> SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 LEFT JOIN gaia_edr3.gaia_source AS g ON (g.source_id = gaps1.source_id) WHERE g.source_id IS NULL; count ------- 0 (1 row) wsdb=> show hash_mem_multiplier ; hash_mem_multiplier --------------------- 2 (1 row) Sergey The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
On 4/5/2025 13:45, Sergey Koposov wrote: > On Sun, 2025-05-04 at 09:52 +0200, Andrei Lepikhov wrote: >> On 4/5/2025 01:05, Sergey Koposov wrote: >>> The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can berelevant. >> Seems promising. May you show us how much NULLs generates underlying >> JOIN operator. I guess, in your example the query should be close to the >> following: >> >> SELECT count(*) FROM disk_sample1 d >> LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 >> ON (gaps1.original_ext_source_id = d.objid) >> WHERE gaps1.original_ext_source_id IS NULL; >> >> And just to be sure, show us the number of NULLs that may be generated >> by alternative LEFT JOIN: >> >> SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 >> LEFT JOIN gaia_edr3.gaia_source AS g >> ON (g.source_id = gaps1.source_id) >> WHERE g.source_id IS NULL; >> >> Also, show please the current value of the GUC hash_mem_multiplier. >> > > Here: > > wsdb=> > SELECT count(*) FROM disk_sample1 d > LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 > ON (gaps1.original_ext_source_id = d.objid) > WHERE gaps1.original_ext_source_id IS NULL; > count > ---------- > 37635427 > (1 row) > > wsdb=> > SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1 > LEFT JOIN gaia_edr3.gaia_source AS g > ON (g.source_id = gaps1.source_id) > WHERE g.source_id IS NULL; > count > ------- > 0 > (1 row) > > wsdb=> show hash_mem_multiplier ; > hash_mem_multiplier > --------------------- > 2 > (1 row) > > Sergey > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336. Hmm. That means the bucket size to store tuples with NULL values should be at least 2.103GB (I think it will borrow more). So, my main conjecture is the extreme data skew issue - when we have said Postgres should use less memory than a single non-divisible batch may fit in. To solve it, we have at least two ongoing projects in this area. Project [1] is dedicated to resolving this issue using 'slices' by matching the too-massive batch to the corresponding batch on the other side of the join in small portions. Project [2] should allow the optimiser to take into account NULLs generated by the nullable side of the join, which will give more chances to avoid skewed HashJoin. Both these projects are not guaranteed to be committed even in PG 19. What's more, there is little chance they will be back-patched into v.15. So, if your problem is exactly with too many nulls, I would say you need a workaround: increment hash_mem_multiplier in case you are sure you may calculate the top boundary for the number of such nulls in this query. Or turn off hash join in this query at all. As for fixing the bug quickly (I have at least one report on this issue right now, too), I think we may invent two or three steps of (local) hash_mem_multiplier incrementation in case a massive not-breaking-down batch is detected. And definitely, it make sense for me to add an error message which will signal people what to do if they stuck into this problem. [1] https://www.postgresql.org/message-id/flat/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f%40vondra.me [2] https://www.postgresql.org/message-id/flat/52b55b53-420a-722c-adbd-706922fc059b%40enterprisedb.com -- regards, Andrei Lepikhov
Thanks, I don't have a strong opinion on how this is fixed, but I personally think that is a bug that needs addressing in back branches. I think I remember hitting OOMs in similar situations with hashjoins in the past and I just disabled them. But I'd certainly argue the situation here is more severe with 12 million temp files (possibly more). Even the cancelling of the query doesn't quite work well, because the deletion of temp files is not fast enough in the cancellation. Only the restarting of the server does delete tmp files fast enough (takes a few min still). And apparently even ls can't quite deal with 12 millions files in a folder. Sergey The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.