Thread: BUG #18909: Query creates millions of temporary files and stalls

BUG #18909: Query creates millions of temporary files and stalls

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18909
Logged by:          Sergey Koposov
Email address:      skoposov@ed.ac.uk
PostgreSQL version: 15.12
Operating system:   Linux, Debian 10
Description:

Hi,
I have a query:
select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,
d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
            from disk_sample1 as d left join
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on
(gaps1.original_ext_source_id=d.objid)
            left join gaia_edr3.gaia_source as g  on ( g.source_id =
gaps1.source_id )
that I ran (from python client), and after it didn't finish in ~ an hour, i
started to investigate, finding that the pgsql_tmp folder had 15 million
temporary files from that query.
If try to explain it in psql, It has this plan:

------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=14443102.62..111487693.90 rows=65470868 width=100)
   Workers Planned: 4
   ->  Nested Loop Left Join  (cost=14442102.62..104939607.10 rows=16367717
width=100)
        ->  Parallel Hash Left Join  (cost=14442102.04..22124798.60
rows=16367717 width=60)
                Hash Cond: (d.objid = gaps1.original_ext_source_id)
                ->  Parallel Seq Scan on disk_sample1 d
(cost=0.00..1218371.17 rows=16367717 width=60)
                ->  Parallel Hash  (cost=10307380.24..10307380.24
rows=237862624 width=16)
                        ->  Parallel Seq Scan on panstarrs1bestneighbour
gaps1  (cost=0.00..10307380.24 rows=237862624 width=16)
        ->  Index Scan using gaia_sourcex_source_id_idx on gaia_source g
(cost=0.58..5.05 rows=1 width=48)
                Index Cond: (source_id = gaps1.source_id)
 JIT:
   Functions: 14
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
(13 rows)
And if I just do create table
create table disk_sample1_gaia  as select
phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanps
fmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on
(gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia
_source as g  on ( g.source_id = gaps1.source_id );
That command finished fine without creating many temporary files. The
resulting table has 65mill rows.
However if I again run the query from python, I very quickly saw postgresql
creating 300k+ files. At which point I interrupted the query.
The way the query is run from python is using
https://github.com/segasai/sqlutilpy
which basically does this (using psycopg):
1) creates named cursor
2) set cursor_tuple_fraction TO 1
3) FETCH  the result in 100,000 row chunks
Here's the gdb trace of postgres from the situation when the pg tries to
create millions of tmp files
**********
#0  0x00007fcd4ce6cc8b in __libc_open64 (file=file@entry=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", oflag=oflag@entry=578) at
../sysdeps/unix/sysv/linux/open64.c:48
#1  0x000056430b609eac in BasicOpenFilePerm (fileName=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at
fd.c:1125
#2  0x000056430b60a2f3 in PathNameOpenFilePerm (fileName=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at
fd.c:1603
#3  0x000056430b60a4fa in PathNameOpenFile (fileFlags=578,
fileName=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669") at fd.c:1806
#4  OpenTemporaryFileInTablespace (tblspcOid=<optimized out>,
rejectError=<optimized out>) at fd.c:1806
#5  0x000056430b60a6d9 in OpenTemporaryFile
(interXact=interXact@entry=false) at fd.c:1743
#6  0x000056430b6082c4 in BufFileCreateTemp
(interXact=interXact@entry=false) at buffile.c:204
#7  0x000056430b4ccaeb in ExecHashJoinSaveTuple
(tuple=tuple@entry=0x56430c243cd8, hashvalue=<optimized out>,
hashvalue@entry=1412196535, fileptr=0x7fc500720d98)
    at nodeHashjoin.c:1248
#8  0x000056430b4c9b3b in ExecHashTableInsert
(hashtable=hashtable@entry=0x56430c2434a8, slot=slot@entry=0x56430c208178,
hashvalue=hashvalue@entry=1412196535) at nodeHash.c:1714
#9  0x000056430b4ca2fc in MultiExecPrivateHash (node=<optimized out>) at
nodeHash.c:188
#10 MultiExecHash (node=node@entry=0x56430c10e098) at nodeHash.c:115
#11 0x000056430b4b213e in MultiExecProcNode (node=node@entry=0x56430c10e098)
at execProcnode.c:520
#12 0x000056430b4cb5dc in ExecHashJoinImpl (parallel=false,
pstate=0x56430c10ca68) at nodeHashjoin.c:297
#13 ExecHashJoin (pstate=0x56430c10ca68) at nodeHashjoin.c:621
#14 0x000056430b4aa92a in ExecProcNode (node=0x56430c10ca68) at
../../../src/include/executor/executor.h:262
#15 ExecutePlan (queryDesc=queryDesc@entry=0x56430c121928,
operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true,
numberTuples=numberTuples@entry=100000,
    direction=direction@entry=ForwardScanDirection,
dest=dest@entry=0x56430bfb70b8) at execMain.c:1640
#16 0x000056430b4ab03b in standard_ExecutorRun (queryDesc=0x56430c121928,
direction=ForwardScanDirection, count=100000, execute_once=<optimized out>)
at execMain.c:362
#17 0x00007fcd4d2556d5 in explain_ExecutorRun (queryDesc=0x56430c121928,
direction=ForwardScanDirection, count=100000, execute_once=<optimized out>)
at auto_explain.c:322
#18 0x000056430b641989 in PortalRunSelect (portal=0x56430c062378,
forward=<optimized out>, count=100000, dest=<optimized out>) at pquery.c:922
#19 0x000056430b643356 in PortalRunFetch (portal=0x56430c062378,
fdirection=FETCH_FORWARD, count=100000, dest=dest@entry=0x56430bfb70b8) at
pquery.c:1425
#20 0x000056430b45ad19 in PerformPortalFetch
(stmt=stmt@entry=0x56430bf5e710, dest=dest@entry=0x56430bfb70b8,
qc=qc@entry=0x7ffcce70ee00) at portalcmds.c:198
#21 0x000056430b644462 in standard_ProcessUtility (pstmt=0x56430bf5ea30,
queryString=0x56430bf5dd08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"",
readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x56430bfb70b8, qc=0x7ffcce70ee00) at utility.c:710
#22 0x000056430b641e90 in PortalRunUtility
(portal=portal@entry=0x56430c062268, pstmt=0x56430bf5ea30,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true,
    dest=dest@entry=0x56430bfb70b8, qc=qc@entry=0x7ffcce70ee00) at
pquery.c:1156
#23 0x000056430b642517 in FillPortalStore
(portal=portal@entry=0x56430c062268, isTopLevel=isTopLevel@entry=true) at
../../../src/include/nodes/pg_list.h:260
#24 0x000056430b64318d in PortalRun (portal=portal@entry=0x56430c062268,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true,
    dest=dest@entry=0x56430bf5eb10, altdest=altdest@entry=0x56430bf5eb10,
qc=0x7ffcce70eff0) at pquery.c:761
#25 0x000056430b63ee91 in exec_simple_query (query_string=0x56430bf5dd08
"FETCH FORWARD 100000 FROM \"sqlutilcursor\"") at postgres.c:1254
#26 0x000056430b63fbd5 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4691
#27 0x000056430b5ac569 in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4515
#28 BackendStartup (port=<optimized out>) at postmaster.c:4243
#29 ServerLoop () at postmaster.c:1811
--Type <RET> for more, q to quit, c to continue without paging--
#30 0x000056430b5ad627 in PostmasterMain (argc=3, argv=0x56430bf574f0) at
postmaster.c:1483
#31 0x000056430b2f2d84 in main (argc=3, argv=0x56430bf574f0) at main.c:204
*************
Interesting when I tried to run the same query when disabling jit, i.e.
set jit_above_cost=-1 before the query, then the query finished fine without
creating millions of tmp files
Also, I tried to run the query just in psql. ( I didn't finish it, since
it's 60+ mil rows), but I think it works fine, i don't see explosion in
number of tmp files. , so I think the fact that the named cursor was used is
important.
I'm happy to provide more info if I can/ if needed.
Thanks,
          Sergey


PG Bug reporting form <noreply@postgresql.org> writes:
> I have a query: ...
> that I ran (from python client), and after it didn't finish in ~ an hour, i
> started to investigate, finding that the pgsql_tmp folder had 15 million
> temporary files from that query.

Does it behave any better if you disable parallelism
(set max_parallel_workers_per_gather = 0) ?

            regards, tom lane



On Fri, 2025-05-02 at 18:12 -0400, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > I have a query: ...
> > that I ran (from python client), and after it didn't finish in ~ an hour, i
> > started to investigate, finding that the pgsql_tmp folder had 15 million
> > temporary files from that query.
>
> Does it behave any better if you disable parallelism
> (set max_parallel_workers_per_gather = 0) ?

Yes, I've just tried and that also prevents the issue.

   S
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

If helpful here's the bt full during the problem (pg was compiled with -O0 -g3 -ggdb)


#0  0x00007f229bd96c8b in __libc_open64 (file=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548", oflag=578)
    at ../sysdeps/unix/sysv/linux/open64.c:48
        resultvar = 495
        sc_ret = <optimized out>
        mode = 384
#1  0x00005615d86aef93 in BasicOpenFilePerm (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548",
fileFlags=578,
    fileMode=384) at fd.c:1125
        fd = 22037
        __func__ = "BasicOpenFilePerm"
#2  0x00005615d86af8d6 in PathNameOpenFilePerm (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548",
    fileFlags=578, fileMode=384) at fd.c:1603
        fnamecopy = 0x56173e322160 "base/pgsql_tmp/pgsql_tmp35816.189548"
        file = 190724
        vfdP = 0x7f1a5539a8f0
        __func__ = "PathNameOpenFilePerm"
#3  0x00005615d86af822 in PathNameOpenFile (fileName=0x7ffffb4c9b80 "base/pgsql_tmp/pgsql_tmp35816.189548",
fileFlags=578)
    at fd.c:1568
No locals.
#4  0x00005615d86afcea in OpenTemporaryFileInTablespace (tblspcOid=1663, rejectError=true) at fd.c:1806
        tempdirpath = "base/pgsql_tmp\000\000\070\231L\373\377\177\000\000\240\037\000\000\377\377", '\000' <repeats 89
times>,"\200\377?", '\000' <repeats 13
 
times>,

"\200\206\373p\330\025V\000\000\000\000\000\000\000\000\000\000\f\000\000\000\000\000\000\000X\240L\373\377\177\000\000@\300b\332\025V\000\000РL\373\377\177\000\
000\305\003q\330\025V\000\000\253t\334<\000\000\000\000G"...
        tempfilepath =

"base/pgsql_tmp/pgsql_tmp35816.189548\000V\000\000\022\000\000\000\000\000\000\000\220\235L\373\377\177\000\000\340l\322g\032\177\000\000\002",
'\000'<repeats
 
63 times>,
"t\234L\373\377\177\000\000x\234L\373\377\177\000\000_\244?$<\327\002\000\205\303\r\004<\327\002\000\000\000\000\000\000\000\000\000\332\340\061
\000\000\000\000\260\234L\373\377\177\000\000\060\227\212\330\025V\000\000\200\234L\373\377\177\000\000"...
        file = 22037
        __func__ = "OpenTemporaryFileInTablespace"
#5  0x00005615d86afb6e in OpenTemporaryFile (interXact=false) at fd.c:1743
        file = 0
#6  0x00005615d86ac995 in BufFileCreateTemp (interXact=false) at buffile.c:204
        file = 0x100000000000003
        pfile = -1879610502
#7  0x00005615d84fb31c in ExecHashJoinSaveTuple (tuple=0x5615da85f5e8, hashvalue=2415356794, fileptr=0x7f1a4f685a80)
    at nodeHashjoin.c:1248
        file = 0x0
#8  0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378, hashvalue=2415356794)
    at nodeHash.c:1714
        shouldFree = true
        tuple = 0x5615da85f5e8
        bucketno = 32992122
        batchno = 3521863
#9  0x00005615d84f3fb1 in MultiExecPrivateHash (node=0x5615da6c2f18) at nodeHash.c:188
        bucketNumber = -1
        outerNode = 0x5615da6c2638
        hashkeys = 0x5615da85ba48
        hashtable = 0x5615da85e5c0
        slot = 0x5615da823378
        econtext = 0x5615da7a9c28
        hashvalue = 2415356794
#10 0x00005615d84f3e63 in MultiExecHash (node=0x5615da6c2f18) at nodeHash.c:115
No locals.
#11 0x00005615d84d532b in MultiExecProcNode (node=0x5615da6c2f18) at execProcnode.c:520
        result = 0x1
        __func__ = "MultiExecProcNode"
#12 0x00005615d84f9ca8 in ExecHashJoinImpl (pstate=0x5615da6c18e8, parallel=false) at nodeHashjoin.c:297
        node = 0x5615da6c18e8
        outerNode = 0x5615da6c1bd8
        hashNode = 0x5615da6c2f18
        joinqual = 0x0
        otherqual = 0x0
        econtext = 0x5615da6c1af8
        hashtable = 0x5615da85e5c0
        outerTupleSlot = 0x0
        hashvalue = 0
        batchno = 0
        parallel_state = 0x0
        __func__ = "ExecHashJoinImpl"
#13 0x00005615d84fa350 in ExecHashJoin (pstate=0x5615da6c18e8) at nodeHashjoin.c:621
No locals.
#14 0x00005615d84d5241 in ExecProcNodeFirst (node=0x5615da6c18e8) at execProcnode.c:464
No locals.
#15 0x00005615d84c9dc3 in ExecProcNode (node=0x5615da6c18e8) at ../../../src/include/executor/executor.h:262
No locals.
#16 0x00005615d84cc539 in ExecutePlan (queryDesc=0x5615da6d6cc8, operation=CMD_SELECT, sendTuples=true,
numberTuples=100000,
    direction=ForwardScanDirection, dest=0x5615da56e0d8) at execMain.c:1640
        estate = 0x5615da6c1688
        planstate = 0x5615da6c18e8
        use_parallel_mode = false
        slot = 0x5615da56e0d8
        current_tuple_count = 0
#17 0x00005615d84ca298 in standard_ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection,
count=100000,
    execute_once=false) at execMain.c:362
        estate = 0x5615da6c1688
        operation = CMD_SELECT
        dest = 0x5615da56e0d8
        sendTuples = true
        oldcontext = 0x5615da6c3580
#18 0x00007f229c17f6d5 in explain_ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection, count=100000,
    execute_once=<optimized out>) at auto_explain.c:322
        _save_exception_stack = 0x7ffffb4ca900
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{
            __jmpbuf = {1, -1720014361324494981, 94651820578848, 140737409496096, 0, 0, -1720014361087516805,
              -1613547901269909637},
            __mask_was_saved = 0,
            __saved_mask = {
              __val = {0, 94651857416032, 0, 0, 0, 0, 0, 140737409492960, 94651827305323, 24, 94651857416032, 0,
                94651857416456, 140737409493008, 94651827395998, 4294967296}
            }
          }}
        _do_rethrow = false
#19 0x00005615d84ca18b in ExecutorRun (queryDesc=0x5615da6d6cc8, direction=ForwardScanDirection, count=100000,
    execute_once=false) at execMain.c:309
No locals.
#20 0x00005615d86ed0a7 in PortalRunSelect (portal=0x5615da616338, forward=true, count=100000, dest=0x5615da56e0d8)
    at pquery.c:922
        queryDesc = 0x5615da6d6cc8
        direction = ForwardScanDirection
        nprocessed = 0
        __func__ = "PortalRunSelect"
#21 0x00005615d86ee021 in DoPortalRunFetch (portal=0x5615da616338, fdirection=FETCH_FORWARD, count=100000,
    dest=0x5615da56e0d8) at pquery.c:1674
        forward = true
        __func__ = "DoPortalRunFetch"
#22 0x00005615d86eda7f in PortalRunFetch (portal=0x5615da616338, fdirection=FETCH_FORWARD, count=100000,
dest=0x5615da56e0d8)
    at pquery.c:1425
        _save_exception_stack = 0x7ffffb4caca0
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{
            __jmpbuf = {1, -1720014361253191813, 94651820578848, 140737409496096, 0, 0, -1720014361318203525,
              -4902996909213425797},
            __mask_was_saved = 0,
            __saved_mask = {
              __val = {2786264749430538240, 94651857065712, 94651858125368, 140737409493456, 64, 94651827132028,
                94651858128616, 7, 0, 7, 94651858125512, 94651858128672, 94651858131912, 140737409493472,
94651827135315,0}
 
            }
          }}
        _do_rethrow = false
        result = 94651858117432
        saveActivePortal = 0x5615da616228
        saveResourceOwner = 0x5615da571c20
        savePortalContext = 0x5615da56df90
        oldContext = 0x5615da56df90
        __func__ = "PortalRunFetch"
#23 0x00005615d84553a9 in PerformPortalFetch (stmt=0x5615da515710, dest=0x5615da56e0d8, qc=0x7ffffb4cac30)
    at portalcmds.c:198
        portal = 0x5615da616338
        nprocessed = 140737409493584
        __func__ = "PerformPortalFetch"
#24 0x00005615d86eec72 in standard_ProcessUtility (pstmt=0x5615da515a30,
    queryString=0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", readOnlyTree=false,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at
utility.c:710
        parsetree = 0x5615da515710
        isTopLevel = true
        isAtomicContext = true
        pstate = 0x5615da56e168
        readonly_flags = 5
        __func__ = "standard_ProcessUtility"
#25 0x00005615d86ee768 in ProcessUtility (pstmt=0x5615da515a30,
    queryString=0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"", readOnlyTree=false,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at
utility.c:530
No locals.
#26 0x00005615d86ed654 in PortalRunUtility (portal=0x5615da616228, pstmt=0x5615da515a30, isTopLevel=true,
    setHoldSnapshot=true, dest=0x5615da56e0d8, qc=0x7ffffb4cac30) at pquery.c:1156
No locals.
#27 0x00005615d86ed3b7 in FillPortalStore (portal=0x5615da616228, isTopLevel=true) at pquery.c:1029
        treceiver = 0x5615da56e0d8
        qc = {
          commandTag = CMDTAG_UNKNOWN,
          nprocessed = 0
        }
        __func__ = "FillPortalStore"
#28 0x00005615d86ecd74 in PortalRun (portal=0x5615da616228, count=9223372036854775807, isTopLevel=true, run_once=true,
    dest=0x5615da515b10, altdest=0x5615da515b10, qc=0x7ffffb4cae00) at pquery.c:761
        _save_exception_stack = 0x7ffffb4caf20
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{
            __jmpbuf = {0, -1720014361414672517, 94651820578848, 140737409496096, 0, 0, -1720014361467101317,
              -4902996908842098821},
            __mask_was_saved = 0,
            __saved_mask = {
              __val = {20843699160, 94651857066752, 128, 94651857062896, 94651857428368, 94651857065712, 4336,
                94651857428632, 94651857063120, 140737409494400, 94651827307209, 7, 112, 94651857066880, 112,
4216106384}
            }
          }}
        _do_rethrow = false
        result = false
        nprocessed = 94651857066768
        saveTopTransactionResourceOwner = 0x5615da5774a8
        saveTopTransactionContext = 0x5615da56af60
        saveActivePortal = 0x0
        saveResourceOwner = 0x5615da5774a8
        savePortalContext = 0x0
        saveMemoryContext = 0x5615da56af60
        __func__ = "PortalRun"
#29 0x00005615d86e6808 in exec_simple_query (query_string=0x5615da514d08 "FETCH FORWARD 100000 FROM
\"sqlutilcursor\"")
    at postgres.c:1254
        snapshot_set = false
        per_parsetree_context = 0x0
        plantree_list = 0x5615da515ac0
        parsetree = 0x5615da515740
        commandTag = CMDTAG_FETCH
        qc = {
          commandTag = CMDTAG_UNKNOWN,
          nprocessed = 0
        }
        querytree_list = 0x5615da5159e0
        portal = 0x5615da616228
        receiver = 0x5615da515b10
        format = 0
        parsetree_item__state = {
          l = 0x5615da515770,
          i = 0
        }
        dest = DestRemote
        oldcontext = 0x5615da56af60
        parsetree_list = 0x5615da515770
        parsetree_item = 0x5615da515788
        save_log_statement_stats = false
        was_logged = false
        use_implicit_block = false
        msec_str =
"\bMQ\332\025V\000\000)\000\000\000\006\000\000\000p\256L\373\377\177\000\000\333\070~\000)\000\000"
        __func__ = "exec_simple_query"
#30 0x00005615d86eb14c in PostgresMain (dbname=0x5615da5703e8 "wsdb", username=0x5615da510588 "koposov") at
postgres.c:4691
        query_string = 0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\""
        firstchar = 81
        input_message = {
          data = 0x5615da514d08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"",
          len = 42,
          maxlen = 1024,
          cursor = 42
        }
        local_sigjmp_buf = {{
            __jmpbuf = {0, -1720014361601319045, 94651820578848, 140737409496096, 0, 0, -1720014361379020933,
              -4902996912041042053},
            __mask_was_saved = 1,
            __saved_mask = {
              __val = {4194304, 18446744073709551536, 0, 0, 139786617312675, 0, 0, 140737409495072, 94651820578848,
                140737409496096, 94651827270967, 129792, 94651857036688, 8248, 42949672960, 94651857748440}
            }
          }}
        send_ready_for_query = false
        idle_in_transaction_timeout_enabled = false
        idle_session_timeout_enabled = false
        __func__ = "PostgresMain"
#31 0x00005615d8638aa1 in BackendRun (port=0x5615da56a760) at postmaster.c:4515
No locals.
#32 0x00005615d86383ed in BackendStartup (port=0x5615da56a760) at postmaster.c:4243
        bn = 0x5615da567980
        pid = 0
        __func__ = "BackendStartup"
#33 0x00005615d8634a2f in ServerLoop () at postmaster.c:1811
        port = 0x5615da56a760
        i = 0
        rmask = {
          fds_bits = {64, 0 <repeats 15 times>}
        }
        selres = 1
        now = 1746287454
        readmask = {
          fds_bits = {4032, 0 <repeats 15 times>}
        }
        nSockets = 12
        last_lockfile_recheck_time = 1746287454
        last_touch_time = 1746287454
        __func__ = "ServerLoop"
#34 0x00005615d8634288 in PostmasterMain (argc=3, argv=0x5615da50e4f0) at postmaster.c:1483
        opt = -1
        status = 0
        userDoption = 0x5615da531d40 "/mnt/bigdata/pgdata15"
        listen_addr_saved = true
        i = 64
        output_config_variable = 0x0
        __func__ = "PostmasterMain"
#35 0x00005615d8544c28 in main (argc=3, argv=0x5615da50e4f0) at main.c:204
        do_check_root = true
quit
Please answer y or n.
Detaching from program: /opt/pgsql15/bin/postgres, process 35816
[Inferior 1 (process 35816) detached]

   S
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
> #8  0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378, hashvalue=2415356794)
>     at nodeHash.c:1714
>         shouldFree = true
>         tuple = 0x5615da85f5e8
>         bucketno = 32992122
>         batchno = 3521863

Yeah, this confirms the idea that the hashtable has exploded into an
unreasonable number of buckets and batches.  I don't know why a
parallel hash join would be more prone to do that than a non-parallel
one, though.  I'm hoping some of the folks who worked on PHJ will
look at this.

What have you got work_mem set to?  I hope it's fairly large, if
you need to join such large tables.

            regards, tom lane



On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote:
> Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
> > #8  0x00005615d84f6a59 in ExecHashTableInsert (hashtable=0x5615da85e5c0, slot=0x5615da823378,
hashvalue=2415356794)
> >     at nodeHash.c:1714
> >         shouldFree = true
> >         tuple = 0x5615da85f5e8
> >         bucketno = 32992122
> >         batchno = 3521863
>
> Yeah, this confirms the idea that the hashtable has exploded into an
> unreasonable number of buckets and batches.  I don't know why a
> parallel hash join would be more prone to do that than a non-parallel
> one, though.  I'm hoping some of the folks who worked on PHJ will
> look at this.
>

Thanks

> What have you got work_mem set to?  I hope it's fairly large, if
> you need to join such large tables.
>

Here're  my memory settings

shared_buffers = 32GB
work_mem = 1GB


    S


The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Re: BUG #18909: Query creates millions of temporary files and stalls

From
Andrei Lepikhov
Date:
On 3/5/2025 18:52, Sergey Koposov wrote:
> On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote:
>> Yeah, this confirms the idea that the hashtable has exploded into an
>> unreasonable number of buckets and batches.  I don't know why a
>> parallel hash join would be more prone to do that than a non-parallel
>> one, though.  I'm hoping some of the folks who worked on PHJ will
>> look at this.
> Here're  my memory settings
> 
> shared_buffers = 32GB
> work_mem = 1GB
May you show statistics from the pg_statistic table on two columns:
"d.objid" and "gaps1.original_ext_source_id"? At least stanullfrac, 
stadistinct, stawidth and MCV frequencies.

Also, an EXPLAIN ANALYZE on this query, which successfully finished 
execution without parallel workers, might provide quick insights.

-- 
regards, Andrei Lepikhov



On Sat, 2025-05-03 at 19:56 +0200, Andrei Lepikhov wrote:
> [You don't often get email from lepihov@gmail.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification]
 
>
> On 3/5/2025 18:52, Sergey Koposov wrote:
> > On Sat, 2025-05-03 at 12:27 -0400, Tom Lane wrote:
> > > Yeah, this confirms the idea that the hashtable has exploded into an
> > > unreasonable number of buckets and batches.  I don't know why a
> > > parallel hash join would be more prone to do that than a non-parallel
> > > one, though.  I'm hoping some of the folks who worked on PHJ will
> > > look at this.
> > Here're  my memory settings
> >
> > shared_buffers = 32GB
> > work_mem = 1GB
> May you show statistics from the pg_statistic table on two columns:
> "d.objid" and "gaps1.original_ext_source_id"? At least stanullfrac,
> stadistinct, stawidth and MCV frequencies.
>
> Also, an EXPLAIN ANALYZE on this query, which successfully finished
> execution without parallel workers, might provide quick insights.
>


For some reason, when running in psql with

wsdb=> set max_parallel_workers_per_gather =0;
SET
wsdb=> explain analyze select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
fromdisk_sample1 as d left join
 
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join
gaia_edr3.gaia_sourceas g  on ( g.source_id = gaps1.source_id
 
) ;

it also triggered the issue, so it didn't finish.
(I don't quite know why I did the same thing earlier in python it worked fine, unless I made a mistake somehow
or the issue depends on which parts of the table are in the buffers)

And here are the statistics:

              table_name               |      column_name       | stanullfrac | stadistinct | stawidth | stakind1 |
stavalues1
| stakind2 | stavalues2 | stakind3 | stavalues3 | stakind4 | stavalues4 | stakind5 | stavalues5

---------------------------------------+------------------------+-------------+-------------+----------+----------+----------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------+----------+------------+----------+------------+----------+------
------+----------+------------
 gaia_edr3_aux.panstarrs1bestneighbour | original_ext_source_id |           0 |          -1 |        8 |        2 |

{65821194190923049,69822775194516201,71021357366914725,71622558219298246,72242570679316167,72752715133276582,73292555331458103,73722640104185638,7431271120960725

3,74872844134480919,75461207930286818,75932867164620064,76482380938670301,77062576694809476,77652902237497049,78262718311176992,78892794541132028,795515044225764

94,80112867400946845,80772802603258902,81391421588335444,81971412486350795,82631186975787792,83252787960950269,83972820164120115,84752904153083118,85512619880528

302,86202647938176936,86842598063633314,87602612520706466,88421075961167122,89172651209793104,89952620931199224,90722572791560233,91650973331643328,9261269053096

6902,93602857051441783,94782658946712918,95712666148671194,96952828393314703,97932742026524718,98862853134788904,99802966167329820,100902836681868414,10190281236

5694460,103031292324379499,104431512205175644,105751238115030863,106953048414271619,108123145169985425,109550825872818887,111053588196313311,112282896513802828,1

13570942665251677,115022939740948640,116382768852523070,117662904386441014,118932892636732902,120190477050925736,121420948001414218,122442964708121838,1238132188

80751231,124992869043914348,126310843575958592,127562955440231806,128872831401617836,130233203538192704,131521007626061188,132613587058728897,134002801572489756,

135283031853672449,136872614522998101,138313019539925635,140112936758215218,141542930246066222,142983169221289589,144482954390781393,146082964742924866,147612878

246882308,149172975717758062,150602803315151189,152220866064652863,154243076414126838,156152890996068633,158242935213542742,160152946363967092,162193040985483908

,164030230662105236,166071022961061724,167933557037565204,169750502017714204,171783528800592714,173633125479747431,175841940675015146,178023541392200467,18008076
9816969530,182390090370071085,185103234223553859,189060291309395827,194410132668322056,213381935620486100}
                   |        3 |
 
|        0 |            |        0 |            |        0 |
 koposov.disk_sample1                  | objid                  |           0 |          -1 |        8 |        2 |

{80171356131152995,85481367757687896,88191325798310474,90321328433852673,92321466354661847,94201275367376529,95701503912853131,97161484423571822,9853144852720894

4,99921427833159308,101151276638551505,102331296875522814,103501223994628897,104551265261567639,105761212996488657,106981229573480759,108151464449806724,10952135

9876084906,110721535047532362,112031293425751247,112991512837345375,114041353644327525,115181393912959844,116271438699167868,117361157875578842,11846119946421047

5,119651383957806910,120691403987052979,121971437975375438,122991360553450551,124081345165940966,125561127251039774,126951413062522443,128261239226290575,1294911

91145884395,130701198797726914,131881390246784832,133071309574028386,134441245837162400,135561325470268624,136831168062713109,138031312799349329,1392611655099586

83,140541093855172591,141721073205266546,142881401626551876,144371304279438040,145581082546437325,147031098350718441,148451143937693348,149711230639891464,151101

129058019814,152461282899556445,153601374120407807,154871110028410102,156201081651853270,157471169532236664,158711380905398833,160070980700916736,161581282957436

146,163001203841617369,164491232859199607,165710958822329339,167161342475229839,168720942762011833,170101108127616821,171491208376278158,173061183798016157,17436

1022423370078,175860958433513392,177241069877220468,178661417914144713,180101182484225880,181550985775752317,183000983447406349,184530819926849727,18598116819371

2346,187450900510134237,188911315124166293,190320741585574970,191851126057417156,193291748947587292,194551520693837798,195430932247712984,196451520400227653,1976

70906570737306,198601716301982705,199510491344039825,200363325766562309,201183158865495520,201921012051408117,202533343632192769,203143424629645681,2039131989607

26463,204762033978144872,205573304352815754,206650175134906301,207582854953296137,208920557060736946,210870029621871940,215863110451068650}
|       3 |
 
|        0 |            |        0 |            |        0 |
(2 rows)


Basically most of ids should be distinct but I would expect most objects form  disk_sample1 have a 1 to 1 match in the
gaps1table
 


   Sergey
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

And this is explain analyze run that finishes in psql, with the default settings (i.e. with parallelism) :

                                                                                      QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------
 Gather  (cost=14443102.62..111487693.90 rows=65470868 width=100) (actual time=212912.377..334892.676 rows=65514296
loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Nested Loop Left Join  (cost=14442102.62..104939607.10 rows=16367717 width=100) (actual
time=209806.019..317684.146rows=13102859 loops=5)
 
         ->  Parallel Hash Left Join  (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual
time=209805.943..273008.489rows=13102859 loops=5)
 
               Hash Cond: (d.objid = gaps1.original_ext_source_id)
               ->  Parallel Seq Scan on disk_sample1 d  (cost=0.00..1218371.17 rows=16367717 width=60) (actual
time=37.353..25185.340rows=13095751 loops=5)
 
               ->  Parallel Hash  (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual
time=169633.067..169633.068rows=190290095 loops=5)
 
                     Buckets: 67108864  Batches: 32  Memory Usage: 1919904kB
                     ->  Parallel Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..10307380.24 rows=237862624
width=16)(actual time=132.295..117548.803
 
rows=190290095 loops=5)
         ->  Index Scan using gaia_sourcex_source_id_idx on gaia_source g  (cost=0.58..5.05 rows=1 width=48) (actual
time=0.003..0.003rows=0 loops=65514296)
 
               Index Cond: (source_id = gaps1.source_id)
 Planning Time: 1.266 ms
 JIT:
   Functions: 75
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 11.796 ms, Inlining 293.374 ms, Optimization 81.354 ms, Emission 173.338 ms, Total 559.861 ms
 Execution Time: 337814.695 ms
(18 rows)

And I did verify again that when I query through python (i.e. using the cursor) with
max_parallel_workers_per_gather=0 it finishes fine.

Also I clearly see that when I query through the cursor in python with the default settings (and when I see the issue),
itclearly uses
 
 a different plan, as opposed to just running the query in psql.
Because when running in psql I see these kinds of files in tmp
pgsql_tmp75270.0.fileset
as opposed to :
pgsql_tmp73459.0 ...

I don't think I know how to see the plan of the declare cursor query.

So summarizing:

the query produces millions of files in

1) query through cursor with default settings (max_parallel_workers_per_gather=4)
2) query through psql with no parallelism(max_parallel_workers_per_gather=0)

it works
3) query through cursor with no parallelism (max_parallel_workers_per_gather=0)
4) query through psql with default settings (max_parallel_workers_per_gather=4)

I hope it makes sense to someone &  helps.

        Sergey




The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
>          ->  Parallel Hash Left Join  (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual
time=209805.943..273008.489rows=13102859 loops=5) 
>                Hash Cond: (d.objid = gaps1.original_ext_source_id)
>                ->  Parallel Seq Scan on disk_sample1 d  (cost=0.00..1218371.17 rows=16367717 width=60) (actual
time=37.353..25185.340rows=13095751 loops=5) 
>                ->  Parallel Hash  (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual
time=169633.067..169633.068rows=190290095 loops=5) 
>                      Buckets: 67108864  Batches: 32  Memory Usage: 1919904kB
>                      ->  Parallel Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..10307380.24 rows=237862624
width=16)(actual time=132.295..117548.803 
> rows=190290095 loops=5)

Hm, interesting.  The number of batches stayed sane here (32), whereas
it went crazy in the other run.  I wonder if there's something
nondeterministic about that choice in a parallel hash join.

> I don't think I know how to see the plan of the declare cursor query.

EXPLAIN DECLARE c CURSOR FOR SELECT ...

            regards, tom lane



On Sat, 2025-05-03 at 18:03 -0400, Tom Lane wrote:
> Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
> >          ->  Parallel Hash Left Join  (cost=14442102.04..22124798.60 rows=16367717 width=60) (actual
time=209805.943..273008.489rows=13102859 loops=5)
 
> >                Hash Cond: (d.objid = gaps1.original_ext_source_id)
> >                ->  Parallel Seq Scan on disk_sample1 d  (cost=0.00..1218371.17 rows=16367717 width=60) (actual
time=37.353..25185.340rows=13095751 loops=5)
 
> >                ->  Parallel Hash  (cost=10307380.24..10307380.24 rows=237862624 width=16) (actual
time=169633.067..169633.068rows=190290095 loops=5)
 
> >                      Buckets: 67108864  Batches: 32  Memory Usage: 1919904kB
> >                      ->  Parallel Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..10307380.24 rows=237862624
width=16)(actual
 
> > time=132.295..117548.803
> > rows=190290095 loops=5)
>
> Hm, interesting.  The number of batches stayed sane here (32), whereas
> it went crazy in the other run.  I wonder if there's something
> nondeterministic about that choice in a parallel hash join.
>
> > I don't think I know how to see the plan of the declare cursor query.
>
> EXPLAIN DECLARE c CURSOR FOR SELECT ...


Thanks! I tried something like that before, but without the cursor name.

Here's the plan for the case that's originally triggered  it:
Note I did not disable the parallelism here by hand.

wsdb=> set cursor_tuple_fraction TO 1;
SET
wsdb=> EXPLAIN DECLARE c CURSOR FOR   select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
from disk_sample1  as d left join
 
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join
gaia_edr3.gaia_sourceas g  on ( g.source_id = gaps1.source_id
 
) ;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=112529534.06..422118114.74 rows=65470868 width=100)
   Hash Cond: (g.source_id = gaps1.source_id)
   ->  Seq Scan on gaia_source g  (cost=0.00..124645680.12 rows=1811786112 width=48)
   ->  Hash  (cost=111007847.21..111007847.21 rows=65470868 width=60)
         ->  Hash Right Join  (cost=3231089.53..111007847.21 rows=65470868 width=60)
               Hash Cond: (gaps1.original_ext_source_id = d.objid)
               ->  Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..17443258.96 rows=951450496 width=16)
               ->  Hash  (cost=1709402.68..1709402.68 rows=65470868 width=60)
                     ->  Seq Scan on disk_sample1 d  (cost=0.00..1709402.68 rows=65470868 width=60)
 JIT:
   Functions: 18
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(12 rows)


       S
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
> wsdb=> set cursor_tuple_fraction TO 1;
> SET
> wsdb=> EXPLAIN DECLARE c CURSOR FOR   select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
>
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
from disk_sample1  as d left join 
> gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join
gaia_edr3.gaia_sourceas g  on ( g.source_id = gaps1.source_id 
> ) ;
>                                                   QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
>  Hash Right Join  (cost=112529534.06..422118114.74 rows=65470868 width=100)
>    Hash Cond: (g.source_id = gaps1.source_id)
>    ->  Seq Scan on gaia_source g  (cost=0.00..124645680.12 rows=1811786112 width=48)
>    ->  Hash  (cost=111007847.21..111007847.21 rows=65470868 width=60)
>          ->  Hash Right Join  (cost=3231089.53..111007847.21 rows=65470868 width=60)
>                Hash Cond: (gaps1.original_ext_source_id = d.objid)
>                ->  Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..17443258.96 rows=951450496 width=16)
>                ->  Hash  (cost=1709402.68..1709402.68 rows=65470868 width=60)
>                      ->  Seq Scan on disk_sample1 d  (cost=0.00..1709402.68 rows=65470868 width=60)

Oh, *that* is interesting: now we have two hash joins, and neither one
has panstarrs1bestneighbour as the table to hash, so it's not too
clear which one is going crazy.  But you showed the stats for
disk_sample1.objid, and that looked pretty well distributed, so I'm
going to guess that that hash is fine.  That leaves the other join
on panstarrs1bestneighbour.source_id as the one under suspicion.
Can we see the stats for that column?

            regards, tom lane



On Sat, 2025-05-03 at 18:47 -0400, Tom Lane wrote:
> Sergey Koposov <Sergey.Koposov@ed.ac.uk> writes:
> > wsdb=> set cursor_tuple_fraction TO 1;
> > SET
> > wsdb=> EXPLAIN DECLARE c CURSOR FOR   select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
> >
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
from disk_sample1  as d left join
 
> > gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on (gaps1.original_ext_source_id=d.objid) left join
gaia_edr3.gaia_sourceas g  on ( g.source_id =
 
> > gaps1.source_id
> > ) ;
> >                                                   QUERY PLAN
> > ---------------------------------------------------------------------------------------------------------------
> >  Hash Right Join  (cost=112529534.06..422118114.74 rows=65470868 width=100)
> >    Hash Cond: (g.source_id = gaps1.source_id)
> >    ->  Seq Scan on gaia_source g  (cost=0.00..124645680.12 rows=1811786112 width=48)
> >    ->  Hash  (cost=111007847.21..111007847.21 rows=65470868 width=60)
> >          ->  Hash Right Join  (cost=3231089.53..111007847.21 rows=65470868 width=60)
> >                Hash Cond: (gaps1.original_ext_source_id = d.objid)
> >                ->  Seq Scan on panstarrs1bestneighbour gaps1  (cost=0.00..17443258.96 rows=951450496 width=16)
> >                ->  Hash  (cost=1709402.68..1709402.68 rows=65470868 width=60)
> >                      ->  Seq Scan on disk_sample1 d  (cost=0.00..1709402.68 rows=65470868 width=60)
>
> Oh, *that* is interesting: now we have two hash joins, and neither one
> has panstarrs1bestneighbour as the table to hash, so it's not too
> clear which one is going crazy.  But you showed the stats for
> disk_sample1.objid, and that looked pretty well distributed, so I'm
> going to guess that that hash is fine.  That leaves the other join
> on panstarrs1bestneighbour.source_id as the one under suspicion.
> Can we see the stats for that column?
>

I don't quite know what you are looking for here, but this is the panstarrs1bestneighbour.source_id.
Those are object ids for ~ a billion objects, so basically I expect most of them to be different,
which is matches my expectation.


              table_name               | column_name | stanullfrac | stadistinct | stawidth | stakind1 |
stavalues1
| stakind2 | stavalues2 | stakind3 | stavalues3 | stakind4 | stavalues4 | stakind5 | stavalues5

---------------------------------------+-------------+-------------+-------------+----------+----------+---------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------+----------+------------+----------+------------+----------+------------+----------+------------
 gaia_edr3_aux.panstarrs1bestneighbour | source_id   |           0 |          -1 |        8 |        2 |

{806904097912320,181280403379521536,231030517101341952,279340244824304640,395434550873163776,430064940901112960,465200384719646464,511243945963340288,55485098513

0936448,884099024809719424,1088756926588540928,1368729867335745024,1743677218036649344,1806876076961984000,1822446535998227072,1827513777076336640,18434769335570

31936,1864935449158255104,1920817367429277568,1968055999039378816,1987254777719980160,2004190452249389056,2020732153709040512,2027928972718120832,203348532469370

5856,2045418530529014272,2059732346543415936,2074956120121436416,2100442043737863296,2163620222393133568,2179506511844570496,2206141459472625664,2252050369194606

336,2591598075455078272,2906007067548492032,2949458190829542272,3032992517301187328,3068094632176092416,3120910944525529984,3215924142328292992,33363926191583658

24,3378822326196932864,3442282147328280064,3663373148030165504,4046707917046071296,4050407357085387776,4052784814808650880,4058733516320497280,406154888016680512

0,4064082807792646528,4067157871310616320,4070490250503721728,4075824943460652416,4079541052298279552,4089451977102617600,4091947563572467328,4096492394857354624

,4101247370255732864,4104701520432256512,4108024244563444608,4111583191923415552,4116149772934806784,4118390921289286016,4121139459806669056,4124851857730469120,

4135313745180553728,4145067478499619328,4150596918151346688,4157604522387932416,4169873659086912896,4186656398413699456,4202234455318397824,4208158108512840448,4

237268606566890112,4252491997962068864,4256170723281997056,4262931890855102848,4268558164899557376,4283472009313681920,4291562353308821376,4298242676736372736,43

08513249218792320,4314497062051819904,4320919893590099456,4369138185748815360,4459756193193297920,4481744016273417216,4507165210444471936,4516588295676538240,453

3771875708241024,4602125356804206464,5598986647730081280,5639341335818720640,5699272347395052928,5980496601182037504,6031590184872417536,6046668097063710464,6251
266251555253888,6755724660278190464,6794820098303412096,6917184811783233280} |        3 |            |        0 |
    |        0 |            |        0 |
 
(1 row)

The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can be
relevant.


           S
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Re: BUG #18909: Query creates millions of temporary files and stalls

From
Andrei Lepikhov
Date:
On 4/5/2025 01:05, Sergey Koposov wrote:
> The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can be
relevant.
Seems promising. May you show us how much NULLs generates underlying 
JOIN operator. I guess, in your example the query should be close to the 
following:

SELECT count(*) FROM disk_sample1 d
   LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
   ON (gaps1.original_ext_source_id = d.objid)
WHERE gaps1.original_ext_source_id IS NULL;

And just to be sure, show us the number of NULLs that may be generated 
by alternative LEFT JOIN:

SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
   LEFT JOIN gaia_edr3.gaia_source AS g
   ON (g.source_id = gaps1.source_id)
WHERE g.source_id IS NULL;

Also, show please the current value of the GUC hash_mem_multiplier.

-- 
regards, Andrei Lepikhov



On Sun, 2025-05-04 at 09:52 +0200, Andrei Lepikhov wrote:
> On 4/5/2025 01:05, Sergey Koposov wrote:
> > The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can
berelevant.
 
> Seems promising. May you show us how much NULLs generates underlying
> JOIN operator. I guess, in your example the query should be close to the
> following:
>
> SELECT count(*) FROM disk_sample1 d
>    LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>    ON (gaps1.original_ext_source_id = d.objid)
> WHERE gaps1.original_ext_source_id IS NULL;
>
> And just to be sure, show us the number of NULLs that may be generated
> by alternative LEFT JOIN:
>
> SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>    LEFT JOIN gaia_edr3.gaia_source AS g
>    ON (g.source_id = gaps1.source_id)
> WHERE g.source_id IS NULL;
>
> Also, show please the current value of the GUC hash_mem_multiplier.
>

Here:

wsdb=>
SELECT count(*) FROM disk_sample1 d
   LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
   ON (gaps1.original_ext_source_id = d.objid)
WHERE gaps1.original_ext_source_id IS NULL;
  count
----------
 37635427
(1 row)

wsdb=>
SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
   LEFT JOIN gaia_edr3.gaia_source AS g
   ON (g.source_id = gaps1.source_id)
WHERE g.source_id IS NULL;
 count
-------
     0
(1 row)

wsdb=> show hash_mem_multiplier ;
 hash_mem_multiplier
---------------------
 2
(1 row)

  Sergey
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 

Re: BUG #18909: Query creates millions of temporary files and stalls

From
Andrei Lepikhov
Date:
On 4/5/2025 13:45, Sergey Koposov wrote:
> On Sun, 2025-05-04 at 09:52 +0200, Andrei Lepikhov wrote:
>> On 4/5/2025 01:05, Sergey Koposov wrote:
>>> The only thing I can add is that panstarrs1bestneighbour is ordered on disk by source_id. I don't know if that can
berelevant.
 
>> Seems promising. May you show us how much NULLs generates underlying
>> JOIN operator. I guess, in your example the query should be close to the
>> following:
>>
>> SELECT count(*) FROM disk_sample1 d
>>     LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>>     ON (gaps1.original_ext_source_id = d.objid)
>> WHERE gaps1.original_ext_source_id IS NULL;
>>
>> And just to be sure, show us the number of NULLs that may be generated
>> by alternative LEFT JOIN:
>>
>> SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>>     LEFT JOIN gaia_edr3.gaia_source AS g
>>     ON (g.source_id = gaps1.source_id)
>> WHERE g.source_id IS NULL;
>>
>> Also, show please the current value of the GUC hash_mem_multiplier.
>>
> 
> Here:
> 
> wsdb=>
> SELECT count(*) FROM disk_sample1 d
>     LEFT JOIN gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>     ON (gaps1.original_ext_source_id = d.objid)
> WHERE gaps1.original_ext_source_id IS NULL;
>    count
> ----------
>   37635427
> (1 row)
> 
> wsdb=>
> SELECT count(*) FROM gaia_edr3_aux.panstarrs1bestneighbour AS gaps1
>     LEFT JOIN gaia_edr3.gaia_source AS g
>     ON (g.source_id = gaps1.source_id)
> WHERE g.source_id IS NULL;
>   count
> -------
>       0
> (1 row)
> 
> wsdb=> show hash_mem_multiplier ;
>   hash_mem_multiplier
> ---------------------
>   2
> (1 row)
> 
>    Sergey
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
 
Hmm. That means the bucket size to store tuples with NULL values should 
be at least 2.103GB (I think it will borrow more).

So, my main conjecture is the extreme data skew issue - when we have 
said Postgres should use less memory than a single non-divisible batch 
may fit in. To solve it, we have at least two ongoing projects in this area.
Project [1] is dedicated to resolving this issue using 'slices' by 
matching the too-massive batch to the corresponding batch on the other 
side of the join in small portions. Project [2] should allow the 
optimiser to take into account NULLs generated by the nullable side of 
the join, which will give more chances to avoid skewed HashJoin.

Both these projects are not guaranteed to be committed even in PG 19. 
What's more, there is little chance they will be back-patched into v.15.

So, if your problem is exactly with too many nulls, I would say you need 
a workaround: increment hash_mem_multiplier in case you are sure you may 
calculate the top boundary for the number of such nulls in this query. 
Or turn off hash join in this query at all.

As for fixing the bug quickly (I have at least one report on this issue 
right now, too), I think we may invent two or three steps of (local) 
hash_mem_multiplier incrementation in case a massive not-breaking-down 
batch is detected.
And definitely, it make sense for me to add an error message which will 
signal people what to do if they stuck into this problem.

[1] 
https://www.postgresql.org/message-id/flat/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f%40vondra.me
[2] 
https://www.postgresql.org/message-id/flat/52b55b53-420a-722c-adbd-706922fc059b%40enterprisedb.com

-- 
regards, Andrei Lepikhov



Thanks,

I don't have a strong opinion on how this is fixed, but I personally think that
is a bug that needs addressing in back branches.

I think I remember hitting OOMs in similar situations with hashjoins in the past
and I just disabled them.

But I'd certainly argue the situation here is more severe with 12 million temp files (possibly more).
Even the cancelling of the query doesn't quite work well,
because the deletion of temp files is not fast enough in the cancellation.
Only the restarting of the server does delete tmp files fast enough (takes a few min still).
And apparently even ls can't quite deal with 12 millions files in a folder.

  Sergey


The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e
buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.