Thread: Shared detoast Datum proposal
Problem: -------- Toast works well for its claimed purposes. However the current detoast infrastructure doesn't shared the detoast datum in the query's lifespan, which will cause some inefficiency like this: SELECT big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' FROM t; In the above case, the big_jsonb_col will be detoast 3 times for every record. a more common case maybe: CREATE TABLE t(a numeric); insert into t select i FROM generate_series(1, 10)i; SELECT * FROM t WHERE a > 3; Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to be detoasted twice, one is in a > 3, the other one is in the targetlist, where we need to call numeric_out. Proposal -------- When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR steps, we can detoast it immediately and save it back to slot->tts_values. With this way, when any other expressions seeks for the Datum, it get the detoast version. Planner decides which Vars should use this feature, executor manages it detoast action and memory management. Planner design --------------- 1. This feature only happen at the Plan Node where the detoast would happen in the previous topology, for example: for example: SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col); toast_col just happens at the Join node's slot even if we have a projection on t1 or t2 at the scan node (except the Parameterized path). However if SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col) WHERE t1.toast_col > 'a'; the detoast *may* happen at the scan of level t1 since "t1.toast_col > 'a'" accesses the Var within a FuncCall ('>' operator), which will cause a detoast. (However it should not happen if it is under a Sort node, for details see Planner Design section 2). At the implementation side, I added "Bitmapset *reference_attrs;" to Scan node which show if the Var should be accessed with the pre-detoast way in expression execution engine. the value is maintained at the create_plan/set_plan_refs stage. Two similar fields are added in Join node. Bitmapset *outer_reference_attrs; Bitmapset *inner_reference_attrs; In the both case, I tracked the level of walker/mutator, if the level greater than 1 when we access a Var, the 'var->varattno - 1' is added to the bitmapset. Some special node should be ignored, see increase_level_for_pre_detoast for details. 2. We also need to make sure the detoast datum will not increase the work_mem usage for the nodes like Sort/Hash etc, all of such nodes can be found with search 'CP_SMALL_TLIST' flags. If the a node under a Sort-Hash-like nodes, we have some extra checking to see if a Var is a *directly input* of such nodes. If yes, we can't detoast it in advance, or else, we know the Var has been discarded before goes to these nodes, we still can use the shared detoast feature. The simplest cases to show this is: For example: 2.1 Query-1 explain (verbose) select * from t1 where b > 'a'; -- b can be detoast in advance. Query-2 explain (verbose) select * from t1 where b > 'a' order by c; -- b can't be detoast since it will makes the Sort use more work_mem. Query-3 explain (verbose) select a, c from t1 where b > 'a' order by c; -- b can be pre-detoasted, since it is discarded before goes to Sort node. In this case it doesn't do anything good, but for some complex case like Query-4, it does. Query-4 explain (costs off, verbose) select t3.* from t1, t2, t3 where t2.c > '999999999999999' and t2.c = t1.c and t3.b = t1.b; QUERY PLAN -------------------------------------------------------------- Hash Join Output: t3.a, t3.b, t3.c Hash Cond: (t3.b = t1.b) -> Seq Scan on public.t3 Output: t3.a, t3.b, t3.c -> Hash Output: t1.b -> Nested Loop Output: t1.b <-- Note here 3 -> Seq Scan on public.t2 Output: t2.a, t2.b, t2.c Filter: (t2.c > '9999...'::text) <--- Note Here 1 -> Index Scan using t1_c_idx on public.t1 Output: t1.a, t1.b, t1.c Index Cond: (t1.c = t2.c) <--- Note Here 2 (15 rows) In this case, detoast datum for t2.c can be shared and it benefits for t2.c = t1.c and no harm for the Hash node. Execution side -------------- Once we decide a Var should be pre-detoasted for a given plan node, a special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created during ExecInitExpr stage. The special steps are introduced to avoid its impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path. slot->tts_values is used to store the detoast datum so that any other expressions can access it pretty easily. Because of the above design, the detoast datum should have a same lifespan as any other slot->tts_values[*], so the default ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt to hold the memory, and maintaining these lifecycles in execTuples.c. To know which datum in slot->tts_values is pre-detoasted, Bitmapset * slot->detoast_attrs is introduced. During the execution of these steps, below code like this is used: static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { Datum oldDatum; MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); oldDatum = slot->tts_values[attnum]; slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); Assert(slot->tts_nvalid > attnum); if (oldDatum != slot->tts_values[attnum]) slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum); MemoryContextSwitchTo(old); } } Testing ------- - shared_detoast_slow.sql is used to test the planner related codes changes. 'set jit to off' will enable more INFO logs about which Var is pre-detoast in which node level. - the cases the patch doesn't help much. create table w(a int, b numeric); insert into w select i, i from generate_series(1, 1000000)i; Q3: select a from w where a > 0; Q4: select b from w where b > 0; pgbench -n -f 1.sql postgres -T 10 -M prepared run 5 times and calculated the average value. | Qry No | Master | patched | perf | comment | |--------+---------+---------+-------+-----------------------------------------| | 3 | 309.438 | 308.411 | 0 | nearly zero impact on them | | 4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out | - good case setup: create table b(big jsonb); insert into b select jsonb_object_agg( x::text, random()::text || random()::text || random()::text ) from generate_series(1,600) f(x); insert into b select (select big from b) from generate_series(1, 1000)i; workload: Q1: select big->'1', big->'2', big->'3', big->'5', big->'10' from b; Q2: select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2; | No | Master | patched | |----+--------+---------| | 1 | 1677 | 357 | | 2 | 638 | 332 | Some Known issues: ------------------ 1. Currently only Scan & Join nodes are considered for this feature. 2. JIT is not adapted for this purpose yet. 3. This feature builds a strong relationship between slot->tts_values and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset, however if someone changes the value directly, like "slot->tts_values[1] = 2;" but without touching the slot's pre_detoast_attrs then troubles comes. The good thing is the detoast can only happen on Scan/Join nodes. so any other slot is impossible to have such issue. I run the following command to find out such cases, looks none of them is a slot form Scan/Join node. egrep -nri 'slot->tts_values\[[^\]]*\] = * Any thought? -- Best Regards Andy Fan
Attachment
Andy Fan <zhihuifan1213@163.com> writes: > > Some Known issues: > ------------------ > > 1. Currently only Scan & Join nodes are considered for this feature. > 2. JIT is not adapted for this purpose yet. JIT is adapted for this feature in v2. Any feedback is welcome. -- Best Regards Andy Fan
Attachment
On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote: > > > Andy Fan <zhihuifan1213@163.com> writes: > > > > > Some Known issues: > > ------------------ > > > > 1. Currently only Scan & Join nodes are considered for this feature. > > 2. JIT is not adapted for this purpose yet. > > JIT is adapted for this feature in v2. Any feedback is welcome. One of the tests was aborted at CFBOT [1] with: [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres [09:47:01.035] [New LWP 28182] [09:47:01.748] [Thread debugging using libthread_db enabled] [09:47:01.748] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [09:47:09.392] Core was generated by `postgres: postgres regression [local] SELECT '. [09:47:09.392] Program terminated with signal SIGSEGV, Segmentation fault. [09:47:09.392] #0 0x00007fa4eed4a5a1 in ?? () [09:47:11.123] [09:47:11.123] Thread 1 (Thread 0x7fa4f8050a40 (LWP 28182)): [09:47:11.123] #0 0x00007fa4eed4a5a1 in ?? () [09:47:11.123] No symbol table info available. ... ... [09:47:11.123] #4 0x00007fa4ebc7a186 in LLVMOrcGetSymbolAddress () at /build/llvm-toolchain-11-HMpQvg/llvm-toolchain-11-11.0.1/llvm/lib/ExecutionEngine/Orc/OrcCBindings.cpp:124 [09:47:11.123] No locals. [09:47:11.123] #5 0x00007fa4eed6fc7a in llvm_get_function (context=0x564b1813a8a0, funcname=0x7fa4eed4a570 "AWAVATSH\201", <incomplete sequence \354\210>) at ../src/backend/jit/llvm/llvmjit.c:460 [09:47:11.123] addr = 94880527996960 [09:47:11.123] __func__ = "llvm_get_function" [09:47:11.123] #6 0x00007fa4eed902e1 in ExecRunCompiledExpr (state=0x0, econtext=0x564b18269d20, isNull=0x7ffc11054d5f) at ../src/backend/jit/llvm/llvmjit_expr.c:2577 [09:47:11.123] cstate = <optimized out> [09:47:11.123] func = 0x564b18269d20 [09:47:11.123] #7 0x0000564b1698e614 in ExecEvalExprSwitchContext (isNull=0x7ffc11054d5f, econtext=0x564b18269d20, state=0x564b182ad820) at ../src/include/executor/executor.h:355 [09:47:11.123] retDatum = <optimized out> [09:47:11.123] oldContext = 0x564b182680d0 [09:47:11.123] retDatum = <optimized out> [09:47:11.123] oldContext = <optimized out> [09:47:11.123] #8 ExecProject (projInfo=0x564b182ad818) at ../src/include/executor/executor.h:389 [09:47:11.123] econtext = 0x564b18269d20 [09:47:11.123] state = 0x564b182ad820 [09:47:11.123] slot = 0x564b182ad788 [09:47:11.123] isnull = false [09:47:11.123] econtext = <optimized out> [09:47:11.123] state = <optimized out> [09:47:11.123] slot = <optimized out> [09:47:11.123] isnull = <optimized out> [09:47:11.123] #9 ExecMergeJoin (pstate=<optimized out>) at ../src/backend/executor/nodeMergejoin.c:836 [09:47:11.123] node = <optimized out> [09:47:11.123] joinqual = 0x0 [09:47:11.123] otherqual = 0x0 [09:47:11.123] qualResult = <optimized out> [09:47:11.123] compareResult = <optimized out> [09:47:11.123] innerPlan = <optimized out> [09:47:11.123] innerTupleSlot = <optimized out> [09:47:11.123] outerPlan = <optimized out> [09:47:11.123] outerTupleSlot = <optimized out> [09:47:11.123] econtext = 0x564b18269d20 [09:47:11.123] doFillOuter = false [09:47:11.123] doFillInner = false [09:47:11.123] __func__ = "ExecMergeJoin" [09:47:11.123] #10 0x0000564b169275b9 in ExecProcNodeFirst (node=0x564b18269db0) at ../src/backend/executor/execProcnode.c:464 [09:47:11.123] No locals. [09:47:11.123] #11 0x0000564b169a2675 in ExecProcNode (node=0x564b18269db0) at ../src/include/executor/executor.h:273 [09:47:11.123] No locals. [09:47:11.123] #12 ExecRecursiveUnion (pstate=0x564b182684a0) at ../src/backend/executor/nodeRecursiveunion.c:115 [09:47:11.123] node = 0x564b182684a0 [09:47:11.123] outerPlan = 0x564b18268d00 [09:47:11.123] innerPlan = 0x564b18269db0 [09:47:11.123] plan = 0x564b182ddc78 [09:47:11.123] slot = <optimized out> [09:47:11.123] isnew = false [09:47:11.123] #13 0x0000564b1695a421 in ExecProcNode (node=0x564b182684a0) at ../src/include/executor/executor.h:273 [09:47:11.123] No locals. [09:47:11.123] #14 CteScanNext (node=0x564b183a6830) at ../src/backend/executor/nodeCtescan.c:103 [09:47:11.123] cteslot = <optimized out> [09:47:11.123] estate = <optimized out> [09:47:11.123] dir = ForwardScanDirection [09:47:11.123] forward = true [09:47:11.123] tuplestorestate = 0x564b183a5cd0 [09:47:11.123] eof_tuplestore = <optimized out> [09:47:11.123] slot = 0x564b183a6be0 [09:47:11.123] #15 0x0000564b1692e22b in ExecScanFetch (node=node@entry=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183 <CteScanNext>, recheckMtd=recheckMtd@entry=0x564b16959db3 <CteScanRecheck>) at ../src/backend/executor/execScan.c:132 [09:47:11.123] estate = <optimized out> [09:47:11.123] #16 0x0000564b1692e332 in ExecScan (node=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183 <CteScanNext>, recheckMtd=recheckMtd@entry=0x564b16959db3 <CteScanRecheck>) at ../src/backend/executor/execScan.c:181 [09:47:11.123] econtext = 0x564b183a6b50 [09:47:11.123] qual = 0x0 [09:47:11.123] projInfo = 0x0 [09:47:11.123] #17 0x0000564b1695a5ea in ExecCteScan (pstate=<optimized out>) at ../src/backend/executor/nodeCtescan.c:164 [09:47:11.123] node = <optimized out> [09:47:11.123] #18 0x0000564b169a81da in ExecProcNode (node=0x564b183a6830) at ../src/include/executor/executor.h:273 [09:47:11.123] No locals. [09:47:11.123] #19 ExecSort (pstate=0x564b183a6620) at ../src/backend/executor/nodeSort.c:149 [09:47:11.123] plannode = 0x564b182dfb78 [09:47:11.123] outerNode = 0x564b183a6830 [09:47:11.123] tupDesc = <optimized out> [09:47:11.123] tuplesortopts = <optimized out> [09:47:11.123] node = 0x564b183a6620 [09:47:11.123] estate = 0x564b182681d0 [09:47:11.123] dir = ForwardScanDirection [09:47:11.123] tuplesortstate = 0x564b18207ff0 [09:47:11.123] slot = <optimized out> [09:47:11.123] #20 0x0000564b169275b9 in ExecProcNodeFirst (node=0x564b183a6620) at ../src/backend/executor/execProcnode.c:464 [09:47:11.123] No locals. [09:47:11.123] #21 0x0000564b16913d01 in ExecProcNode (node=0x564b183a6620) at ../src/include/executor/executor.h:273 [09:47:11.123] No locals. [09:47:11.123] #22 ExecutePlan (estate=estate@entry=0x564b182681d0, planstate=0x564b183a6620, use_parallel_mode=<optimized out>, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0, direction=ForwardScanDirection, dest=0x564b182e2728, execute_once=true) at ../src/backend/executor/execMain.c:1670 [09:47:11.123] slot = <optimized out> [09:47:11.123] current_tuple_count = 0 [09:47:11.123] #23 0x0000564b16914024 in standard_ExecutorRun (queryDesc=0x564b181ba200, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ../src/backend/executor/execMain.c:365 [09:47:11.123] estate = 0x564b182681d0 [09:47:11.123] operation = CMD_SELECT [09:47:11.123] dest = 0x564b182e2728 [09:47:11.123] sendTuples = true [09:47:11.123] oldcontext = 0x564b181ba100 [09:47:11.123] __func__ = "standard_ExecutorRun" [09:47:11.123] #24 0x0000564b1691418f in ExecutorRun (queryDesc=queryDesc@entry=0x564b181ba200, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at ../src/backend/executor/execMain.c:309 [09:47:11.123] No locals. [09:47:11.123] #25 0x0000564b16d208af in PortalRunSelect (portal=portal@entry=0x564b1817ae10, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x564b182e2728) at ../src/backend/tcop/pquery.c:924 [09:47:11.123] queryDesc = 0x564b181ba200 [09:47:11.123] direction = <optimized out> [09:47:11.123] nprocessed = <optimized out> [09:47:11.123] __func__ = "PortalRunSelect" [09:47:11.123] #26 0x0000564b16d2405b in PortalRun (portal=portal@entry=0x564b1817ae10, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564b182e2728, altdest=altdest@entry=0x564b182e2728, qc=0x7ffc110551e0) at ../src/backend/tcop/pquery.c:768 [09:47:11.123] _save_exception_stack = 0x7ffc11055290 [09:47:11.123] _save_context_stack = 0x0 [09:47:11.123] _local_sigjmp_buf = {{__jmpbuf = {1, 3431825231787999889, 94880528213800, 94880526221072, 94880526741008, 94880526221000, -3433879442176195951, -8991885832768699759}, __mask_was_saved = 0, __saved_mask = {__val = {140720594047343, 688, 94880526749216, 94880511205302, 1, 140720594047343, 94880526999808, 8, 94880526213088, 112, 179, 94880526221072, 94880526221048, 94880526221000, 94880508502828, 2}}}} [09:47:11.123] _do_rethrow = <optimized out> [09:47:11.123] result = <optimized out> [09:47:11.123] nprocessed = <optimized out> [09:47:11.123] saveTopTransactionResourceOwner = 0x564b18139ad8 [09:47:11.123] saveTopTransactionContext = 0x564b181229f0 [09:47:11.123] saveActivePortal = 0x0 [09:47:11.123] saveResourceOwner = 0x564b18139ad8 [09:47:11.123] savePortalContext = 0x0 [09:47:11.123] saveMemoryContext = 0x564b181229f0 [09:47:11.123] __func__ = "PortalRun" [09:47:11.123] #27 0x0000564b16d1d098 in exec_simple_query (query_string=query_string@entry=0x564b180fa0e0 "with recursive search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect * from search"...) at ../src/backend/tcop/postgres.c:1273 [09:47:11.123] cmdtaglen = 6 [09:47:11.123] snapshot_set = <optimized out> [09:47:11.123] per_parsetree_context = 0x0 [09:47:11.123] plantree_list = 0x564b182e26d8 [09:47:11.123] parsetree = 0x564b180fbec8 [09:47:11.123] commandTag = <optimized out> [09:47:11.123] qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0} [09:47:11.123] querytree_list = <optimized out> [09:47:11.123] portal = 0x564b1817ae10 [09:47:11.123] receiver = 0x564b182e2728 [09:47:11.123] format = 0 [09:47:11.123] cmdtagname = <optimized out> [09:47:11.123] parsetree_item__state = {l = <optimized out>, i = <optimized out>} [09:47:11.123] dest = DestRemote [09:47:11.123] oldcontext = 0x564b181229f0 [09:47:11.123] parsetree_list = 0x564b180fbef8 [09:47:11.123] parsetree_item = 0x564b180fbf10 [09:47:11.123] save_log_statement_stats = false [09:47:11.123] was_logged = false [09:47:11.123] use_implicit_block = false [09:47:11.123] msec_str = "\004\000\000\000\000\000\000\000\346[\376\026KV\000\000pR\005\021\374\177\000\000\335\000\000\000\000\000\000" [09:47:11.123] __func__ = "exec_simple_query" [09:47:11.123] #28 0x0000564b16d1fe33 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../src/backend/tcop/postgres.c:4653 [09:47:11.123] query_string = 0x564b180fa0e0 "with recursive search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect * from search"... [09:47:11.123] firstchar = <optimized out> [09:47:11.123] input_message = {data = 0x564b180fa0e0 "with recursive search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect * from search"..., len = 221, maxlen = 1024, cursor = 221} [09:47:11.123] local_sigjmp_buf = {{__jmpbuf = {94880520543032, -8991887800862096751, 0, 4, 140720594048004, 1, -3433879442247499119, -8991885813233991023}, __mask_was_saved = 1, __saved_mask = {__val = {4194304, 1, 140346553036196, 94880526187920, 15616, 15680, 94880508418872, 0, 94880526187920, 15616, 94880520537224, 4, 140720594048004, 1, 94880508502235, 1}}}} [09:47:11.123] send_ready_for_query = false [09:47:11.123] idle_in_transaction_timeout_enabled = false [09:47:11.123] idle_session_timeout_enabled = false [09:47:11.123] __func__ = "PostgresMain" [09:47:11.123] #29 0x0000564b16bcc4e4 in BackendRun (port=port@entry=0x564b18126f50) at ../src/backend/postmaster/postmaster.c:4464 [1] - https://cirrus-ci.com/task/4765094966460416 Regards, Vignesh
Hi, vignesh C <vignesh21@gmail.com> writes: > On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote: >> >> >> Andy Fan <zhihuifan1213@163.com> writes: >> >> > >> > Some Known issues: >> > ------------------ >> > >> > 1. Currently only Scan & Join nodes are considered for this feature. >> > 2. JIT is not adapted for this purpose yet. >> >> JIT is adapted for this feature in v2. Any feedback is welcome. > > One of the tests was aborted at CFBOT [1] with: > [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for > /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres > [09:47:01.035] [New LWP 28182] There was a bug in JIT part, here is the fix. Thanks for taking care of this! -- Best Regards Andy Fan
Attachment
Andy Fan <zhihuifan1213@163.com> writes: >> >> One of the tests was aborted at CFBOT [1] with: >> [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for >> /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres >> [09:47:01.035] [New LWP 28182] > > There was a bug in JIT part, here is the fix. Thanks for taking care of > this! Fixed a GCC warning in cirrus-ci, hope everything is fine now. -- Best Regards Andy Fan
Attachment
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there were CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/4759/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759 Kind Regards, Peter Smith.
Hi, Peter Smith <smithpb2250@gmail.com> writes: > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there were CFbot test failures last time it was run [2]. Please have a > look and post an updated version if necessary. > > ====== > [1] https://commitfest.postgresql.org/46/4759/ > [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759 v5 attached, it should fix the above issue. This version also introduce a data struct called bitset, which has a similar APIs like bitmapset but have the ability to reset all bits without recycle its allocated memory, this is important for this feature. commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2) Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Tue Jan 23 13:38:34 2024 +0800 shared detoast feature. commit 14a6eafef9ff4926b8b877d694de476657deee8a Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Mon Jan 22 15:48:33 2024 +0800 Introduce a Bitset data struct. While Bitmapset is designed for variable-length of bits, Bitset is designed for fixed-length of bits, the fixed length must be specified at the bitset_init stage and keep unchanged at the whole lifespan. Because of this, some operations on Bitset is simpler than Bitmapset. The bitset_clear unsets all the bits but kept the allocated memory, this capacity is impossible for bit Bitmapset for some solid reasons and this is the main reason to add this data struct. Also for performance aspect, the functions for Bitset removed some unlikely checks, instead with some Asserts. [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com I didn't write a good commit message for commit 2, the people who is interested with this can see the first message in this thread for explaination. I think anyone whose customer uses lots of jsonb probably can get benefits from this. the precondition is the toast value should be accessed 1+ times, including the jsonb_out function. I think this would be not rare to happen. -- Best Regards Andy Fan
Hi Andy,
It looks like v5 is missing in your mail. Could you please check and resend it?
Thanks,
Michael.
It looks like v5 is missing in your mail. Could you please check and resend it?
Thanks,
Michael.
On 1/23/24 08:44, Andy Fan wrote:
Hi, Peter Smith <smithpb2250@gmail.com> writes:2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there were CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/4759/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759v5 attached, it should fix the above issue. This version also introduce a data struct called bitset, which has a similar APIs like bitmapset but have the ability to reset all bits without recycle its allocated memory, this is important for this feature. commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2) Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Tue Jan 23 13:38:34 2024 +0800 shared detoast feature. commit 14a6eafef9ff4926b8b877d694de476657deee8a Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Mon Jan 22 15:48:33 2024 +0800 Introduce a Bitset data struct. While Bitmapset is designed for variable-length of bits, Bitset is designed for fixed-length of bits, the fixed length must be specified at the bitset_init stage and keep unchanged at the whole lifespan. Because of this, some operations on Bitset is simpler than Bitmapset. The bitset_clear unsets all the bits but kept the allocated memory, this capacity is impossible for bit Bitmapset for some solid reasons and this is the main reason to add this data struct. Also for performance aspect, the functions for Bitset removed some unlikely checks, instead with some Asserts. [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com I didn't write a good commit message for commit 2, the people who is interested with this can see the first message in this thread for explaination. I think anyone whose customer uses lots of jsonb probably can get benefits from this. the precondition is the toast value should be accessed 1+ times, including the jsonb_out function. I think this would be not rare to happen.
-- Michael Zhilin Postgres Professional +7(925)3366270 https://www.postgrespro.ru
Michael Zhilin <m.zhilin@postgrespro.ru> writes: > Hi Andy, > > It looks like v5 is missing in your mail. Could you please check and resend it? ha, yes.. v5 is really attached this time. commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> shared_detoast_value_v3) Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Tue Jan 23 13:38:34 2024 +0800 shared detoast feature. details at https://postgr.es/m/87il4jrk1l.fsf%40163.com commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9 Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Mon Jan 22 15:48:33 2024 +0800 Introduce a Bitset data struct. While Bitmapset is designed for variable-length of bits, Bitset is designed for fixed-length of bits, the fixed length must be specified at the bitset_init stage and keep unchanged at the whole lifespan. Because of this, some operations on Bitset is simpler than Bitmapset. The bitset_clear unsets all the bits but kept the allocated memory, this capacity is impossible for bit Bitmapset for some solid reasons and this is the main reason to add this data struct. Also for performance aspect, the functions for Bitset removed some unlikely checks, instead with some Asserts. [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com As for the commit "Introduce a Bitset data struct.", the test coverage is 100% now. So it would be great that people can review this first. -- Best Regards Andy Fan
Attachment
Hi, I didn't another round of self-review. Comments, variable names, the order of function definition are improved so that it can be read as smooth as possible. so v6 attached. -- Best Regards Andy Fan
Attachment
Hi, I took a quick look on this thread/patch today, so let me share a couple initial thoughts. I may not have a particularly coherent/consistent opinion on the patch or what would be a better way to do this yet, but perhaps it'll start a discussion ... The goal of the patch (as I understand it) is essentially to cache detoasted values, so that the value does not need to be detoasted repeatedly in different parts of the plan. I think that's perfectly sensible and worthwhile goal - detoasting is not cheap, and complex plans may easily spend a lot of time on it. That being said, the approach seems somewhat invasive, and touching parts I wouldn't have expected to need a change to implement this. For example, I certainly would not have guessed the patch to need changes in createplan.c or setrefs.c. Perhaps it really needs to do these things, but neither the thread nor the comments are very enlightening as for why it's needed :-( In many cases I can guess, but I'm not sure my guess is correct. And comments in code generally describe what's happening locally / next line, not the bigger picture & why it's happening. IIUC we walk the plan to decide which Vars should be detoasted (and cached) once, and which cases should not do that because it'd inflate the amount of data we need to keep in a Sort, Hash etc. Not sure if there's a better way to do this - it depends on what happens in the upper parts of the plan, so we can't decide while building the paths. But maybe we could decide this while transforming the paths into a plan? (I realize the JIT thread nearby needs to do something like that in create_plan, and in that one I suggested maybe walking the plan would be a better approach, so I may be contradicting myself a little bit.). In any case, set_plan_forbid_pre_detoast_vars_recurse should probably explain the overall strategy / reasoning in a bit more detail. Maybe it's somewhere in this thread, but that's not great for reviewers. Similar for the setrefs.c changes. It seems a bit suspicious to piggy back the new code into fix_scan_expr/fix_scan_list and similar code. Those functions have a pretty clearly defined purpose, not sure we want to also extend them to also deal with this new thing. (FWIW I'd 100%% did it this way if I hacked on a PoC of this, to make it work. But I'm not sure it's the right solution.) I don't know what to thing about the Bitset - maybe it's necessary, but how would I know? I don't have any way to measure the benefits, because the 0002 patch uses it right away. I think it should be done the other way around, i.e. the patch should introduce the main feature first (using the traditional Bitmapset), and then add Bitset on top of that. That way we could easily measure the impact and see if it's useful. On the whole, my biggest concern is memory usage & leaks. It's not difficult to already have problems with large detoasted values, and if we start keeping more of them, that may get worse. Or at least that's my intuition - it can't really get better by keeping the values longer, right? The other thing is the risk of leaks (in the sense of keeping detoasted values longer than expected). I see the values are allocated in tts_mcxt, and maybe that's the right solution - not sure. FWIW while looking at the patch, I couldn't help but to think about expanded datums. There's similarity in what these two features do - keep detoasted values for a while, so that we don't need to do the expensive processing if we access them repeatedly. Of course, expanded datums are not meant to be long-lived, while "shared detoasted values" are meant to exist (potentially) for the query duration. But maybe there's something we could learn from expanded datums? For example how the varlena pointer is leveraged to point to the expanded object. For example, what if we add a "TOAST cache" as a query-level hash table, and modify the detoasting to first check the hash table (with the TOAST pointer as a key)? It'd be fairly trivial to enforce a memory limit on the hash table, evict values from it, etc. And it wouldn't require any of the createplan/setrefs changes, I think ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: Hi Tomas, > > I took a quick look on this thread/patch today, so let me share a couple > initial thoughts. I may not have a particularly coherent/consistent > opinion on the patch or what would be a better way to do this yet, but > perhaps it'll start a discussion ... Thank you for this! > > The goal of the patch (as I understand it) is essentially to cache > detoasted values, so that the value does not need to be detoasted > repeatedly in different parts of the plan. I think that's perfectly > sensible and worthwhile goal - detoasting is not cheap, and complex > plans may easily spend a lot of time on it. exactly. > > That being said, the approach seems somewhat invasive, and touching > parts I wouldn't have expected to need a change to implement this. For > example, I certainly would not have guessed the patch to need changes in > createplan.c or setrefs.c. > > Perhaps it really needs to do these things, but neither the thread nor > the comments are very enlightening as for why it's needed :-( In many > cases I can guess, but I'm not sure my guess is correct. And comments in > code generally describe what's happening locally / next line, not the > bigger picture & why it's happening. there were explaination at [1], but it probably is too high level. Writing a proper comments is challenging for me, but I am pretty happy to try more. At the end of this writing, I explained the data workflow, I am feeling that would be useful for reviewers. > IIUC we walk the plan to decide which Vars should be detoasted (and > cached) once, and which cases should not do that because it'd inflate > the amount of data we need to keep in a Sort, Hash etc. Exactly. > Not sure if > there's a better way to do this - it depends on what happens in the > upper parts of the plan, so we can't decide while building the paths. I'd say I did this intentionally. Deciding such things in paths will be more expensive than create_plan stage IMO. > But maybe we could decide this while transforming the paths into a plan? > (I realize the JIT thread nearby needs to do something like that in > create_plan, and in that one I suggested maybe walking the plan would be > a better approach, so I may be contradicting myself a little bit.). I think that's pretty similar what I'm doing now. Just that I did it *just after* the create_plan. This is because the create_plan doesn't transform the path to plan in the top->down manner all the time, the known exception is create_mergejoin_plan. so I have to walk just after the create_plan is done. In the create_mergejoin_plan, the Sort node is created *after* the subplan for the Sort is created. /* Recursively process the path tree, demanding the correct tlist result */ plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST); + /* + * After the plan tree is built completed, we start to walk for which + * expressions should not used the shared-detoast feature. + */ + set_plan_forbid_pre_detoast_vars_recurse(plan, NIL); > > In any case, set_plan_forbid_pre_detoast_vars_recurse should probably > explain the overall strategy / reasoning in a bit more detail. Maybe > it's somewhere in this thread, but that's not great for reviewers. a lession learnt, thanks. a revisted version of comments from the lastest patch. /* * set_plan_forbid_pre_detoast_vars_recurse * Walking the Plan tree in the top-down manner to gather the vars which * should be as small as possible and record them in Plan.forbid_pre_detoast_vars * * plan: the plan node to walk right now. * small_tlist: a list of nodes which its subplan should provide them as * small as possible. */ static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist) > > Similar for the setrefs.c changes. It seems a bit suspicious to piggy > back the new code into fix_scan_expr/fix_scan_list and similar code. > Those functions have a pretty clearly defined purpose, not sure we want > to also extend them to also deal with this new thing. (FWIW I'd 100%% > did it this way if I hacked on a PoC of this, to make it work. But I'm > not sure it's the right solution.) The main reason of doing so is because I want to share the same walk effort as fix_scan_expr. otherwise I have to walk the plan for every expression again. I thought this as a best practice in the past and thought we can treat the pre_detoast_attrs as a valuable side effects:( > I don't know what to thing about the Bitset - maybe it's necessary, but > how would I know? I don't have any way to measure the benefits, because > the 0002 patch uses it right away. a revisted version of comments from the latest patch. graph 2 explains this decision. /* * The attributes whose values are the detoasted version in tts_values[*], * if so these memory needs some extra clean-up. These memory can't be put * into ecxt_per_tuple_memory since many of them needs a longer life span, * for example the Datum in outer join. These memory is put into * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is * invalidated. * * Bitset rather than Bitmapset is chosen here because when all the members * of Bitmapset are deleted, the allocated memory will be deallocated * automatically, which is too expensive in this case since we need to * deleted all the members in each ExecClearTuple and repopulate it again * when fill the detoast datum to tts_values[*]. This situation will be run * again and again in an execution cycle. * * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps. */ Bitset *pre_detoasted_attrs; > I think it should be done the other > way around, i.e. the patch should introduce the main feature first > (using the traditional Bitmapset), and then add Bitset on top of that. > That way we could easily measure the impact and see if it's useful. Acutally v4 used the Bitmapset, and then both perf and pgbench's tps indicate it is too expensive. and after talk with David at [2], I introduced bitset and use it here. the test case I used comes from [1]. IRCC, there were 5% performance difference because of this. create table w(a int, b numeric); insert into w select i, i from generate_series(1, 1000000)i; select b from w where b > 0; To reproduce the difference, we can replace the bitset_clear() with bitset_free(slot->pre_detoasted_attrs); slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts); in ExecFreePreDetoastDatum. then it works same as Bitmapset. > On the whole, my biggest concern is memory usage & leaks. It's not > difficult to already have problems with large detoasted values, and if > we start keeping more of them, that may get worse. Or at least that's my > intuition - it can't really get better by keeping the values longer, right? > > The other thing is the risk of leaks (in the sense of keeping detoasted > values longer than expected). I see the values are allocated in > tts_mcxt, and maybe that's the right solution - not sure. about the memory usage, first it is kept as the same lifesplan as the tts_values[*] which can be released pretty quickly, only if the certain values of the tuples is not needed. it is true that we keep the detoast version longer than before, but that's something we have to pay I think. Leaks may happen since tts_mcxt is reset at the end of *executor*. So if we forget to release the memory when the tts_values[*] is invalidated somehow, the memory will be leaked until the end of executor. I think that will be enough to cause an issue. Currently besides I release such memory at the ExecClearTuple, I also relase such memory whenever we set tts_nvalid to 0, the theory used here is: /* * tts_values is treated invalidated since tts_nvalid is set to 0, so * let's free the pre-detoast datum. */ ExecFreePreDetoastDatum(slot); I will do more test on the memory leak stuff, since there are so many operation aginst slot like ExecCopySlot etc, I don't know how to test it fully. the method in my mind now is use TPCH with 10GB data size, and monitor the query runtime memory usage. > FWIW while looking at the patch, I couldn't help but to think about > expanded datums. There's similarity in what these two features do - keep > detoasted values for a while, so that we don't need to do the expensive > processing if we access them repeatedly. Could you provide some keyword or function names for the expanded datum here, I probably miss this. > Of course, expanded datums are > not meant to be long-lived, while "shared detoasted values" are meant to > exist (potentially) for the query duration. hmm, acutally the "shared detoast value" just live in the TupleTableSlot->tts_values[*], rather than the whole query duration. The simple case is: SELECT * FROM t WHERE a_text LIKE 'abc%'; when we scan to the next tuple, the detoast value for the previous tuple will be relased. > But maybe there's something > we could learn from expanded datums? For example how the varlena pointer > is leveraged to point to the expanded object. maybe. currently I just use detoast_attr to get the desired version. I'm pleasure if we have more effective way. if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { Datum oldDatum; MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); oldDatum = slot->tts_values[attnum]; slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); Assert(slot->tts_nvalid > attnum); Assert(oldDatum != slot->tts_values[attnum]); bitset_add_member(slot->pre_detoasted_attrs, attnum); MemoryContextSwitchTo(old); } > For example, what if we add a "TOAST cache" as a query-level hash table, > and modify the detoasting to first check the hash table (with the TOAST > pointer as a key)? It'd be fairly trivial to enforce a memory limit on > the hash table, evict values from it, etc. And it wouldn't require any > of the createplan/setrefs changes, I think ... Hmm, I am not sure I understand you correctly at this part. In the current patch, to avoid the run-time (ExecExprInterp) check if we should detoast and save the datum, I defined 3 extra steps so that the *extra check itself* is not needed for unnecessary attributes. for example an datum for int or a detoast datum should not be saved back to tts_values[*] due to the small_tlist reason. However these steps can be generated is based on the output of createplan/setrefs changes. take the INNER_VAR for example: In ExecInitExprRec: switch (variable->varno) { case INNER_VAR: if (is_join_plan(plan) && bms_is_member(attnum, ((JoinState *) state->parent)->inner_pre_detoast_attrs)) { scratch.opcode = EEOP_INNER_VAR_TOAST; } else { scratch.opcode = EEOP_INNER_VAR; } } The data workflow is: 1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c) decides which Vars should *not* be pre_detoasted because of small_tlist reason and record it in Plan.forbid_pre_detoast_vars. 2. fix_scan_expr (in the setrefs.c) tracks which Vars should be detoasted for the specific plan node and record them in it. Currently only Scan and Join nodes support this feature. typedef struct Scan { ... /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *reference_attrs; } Scan; typedef struct Join { .. /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *outer_reference_attrs; Bitmapset *inner_reference_attrs; } Join; 3. during the InitPlan stage, we maintain the PlanState.xxx_pre_detoast_attrs and generated different StepOp for them. 4. At the ExecExprInterp stage, only the new StepOp do the extra check to see if the detoast should happen. Other steps doesn't need this check at all. If we avoid the createplan/setref.c changes, probabaly some unrelated StepOp needs the extra check as well? When I worked with the UniqueKey feature, I maintained a UniqueKey.README to summaried all the dicussed topics in threads, the README is designed to save the effort for more reviewer, I think I should apply the same logic for this feature. Thank you very much for your feedback! v7 attached, just some comments and Assert changes. [1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com [2] https://www.postgresql.org/message-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com -- Best Regards Andy Fan
Attachment
On 2/20/24 19:38, Andy Fan wrote: > > ... > >> I think it should be done the other >> way around, i.e. the patch should introduce the main feature first >> (using the traditional Bitmapset), and then add Bitset on top of that. >> That way we could easily measure the impact and see if it's useful. > > Acutally v4 used the Bitmapset, and then both perf and pgbench's tps > indicate it is too expensive. and after talk with David at [2], I > introduced bitset and use it here. the test case I used comes from [1]. > IRCC, there were 5% performance difference because of this. > > create table w(a int, b numeric); > insert into w select i, i from generate_series(1, 1000000)i; > select b from w where b > 0; > > To reproduce the difference, we can replace the bitset_clear() with > > bitset_free(slot->pre_detoasted_attrs); > slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts); > > in ExecFreePreDetoastDatum. then it works same as Bitmapset. > I understand the bitset was not introduced until v5, after noticing the bitmapset is not quite efficient. But I still think the patches should be the other way around, i.e. the main feature first, then the bitset as an optimization. That allows everyone to observe the improvement on their own (without having to tweak the code), and it also doesn't require commit of the bitset part before it gets actually used by anything. > >> On the whole, my biggest concern is memory usage & leaks. It's not >> difficult to already have problems with large detoasted values, and if >> we start keeping more of them, that may get worse. Or at least that's my >> intuition - it can't really get better by keeping the values longer, right? >> >> The other thing is the risk of leaks (in the sense of keeping detoasted >> values longer than expected). I see the values are allocated in >> tts_mcxt, and maybe that's the right solution - not sure. > > about the memory usage, first it is kept as the same lifesplan as the > tts_values[*] which can be released pretty quickly, only if the certain > values of the tuples is not needed. it is true that we keep the detoast > version longer than before, but that's something we have to pay I > think. > > Leaks may happen since tts_mcxt is reset at the end of *executor*. So if > we forget to release the memory when the tts_values[*] is invalidated > somehow, the memory will be leaked until the end of executor. I think > that will be enough to cause an issue. Currently besides I release such > memory at the ExecClearTuple, I also relase such memory whenever we set > tts_nvalid to 0, the theory used here is: > > /* > * tts_values is treated invalidated since tts_nvalid is set to 0, so > * let's free the pre-detoast datum. > */ > ExecFreePreDetoastDatum(slot); > > I will do more test on the memory leak stuff, since there are so many > operation aginst slot like ExecCopySlot etc, I don't know how to test it > fully. the method in my mind now is use TPCH with 10GB data size, and > monitor the query runtime memory usage. > I think this is exactly the "high level design" description that should be in a comment, somewhere. > >> FWIW while looking at the patch, I couldn't help but to think about >> expanded datums. There's similarity in what these two features do - keep >> detoasted values for a while, so that we don't need to do the expensive >> processing if we access them repeatedly. > > Could you provide some keyword or function names for the expanded datum > here, I probably miss this. > see src/include/utils/expandeddatum.h >> Of course, expanded datums are >> not meant to be long-lived, while "shared detoasted values" are meant to >> exist (potentially) for the query duration. > > hmm, acutally the "shared detoast value" just live in the > TupleTableSlot->tts_values[*], rather than the whole query duration. The > simple case is: > > SELECT * FROM t WHERE a_text LIKE 'abc%'; > > when we scan to the next tuple, the detoast value for the previous tuple > will be relased. > But if the (detoasted) value is passed to the next executor node, it'll be kept, right? >> But maybe there's something >> we could learn from expanded datums? For example how the varlena pointer >> is leveraged to point to the expanded object. > > maybe. currently I just use detoast_attr to get the desired version. I'm > pleasure if we have more effective way. > > if (!slot->tts_isnull[attnum] && > VARATT_IS_EXTENDED(slot->tts_values[attnum])) > { > Datum oldDatum; > MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); > > oldDatum = slot->tts_values[attnum]; > slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); > Assert(slot->tts_nvalid > attnum); > Assert(oldDatum != slot->tts_values[attnum]); > bitset_add_member(slot->pre_detoasted_attrs, attnum); > MemoryContextSwitchTo(old); > } > Right. FWIW I'm not sure if expanded objects are similar enough to be a useful inspiration. Unrelated question - could this actually end up being too aggressive? That is, could it detoast attributes that we end up not needing? For example, what if the tuple gets eliminated for some reason (e.g. because of a restriction on the table, or not having a match in a join)? Won't we detoast the tuple only to throw it away? > >> For example, what if we add a "TOAST cache" as a query-level hash table, >> and modify the detoasting to first check the hash table (with the TOAST >> pointer as a key)? It'd be fairly trivial to enforce a memory limit on >> the hash table, evict values from it, etc. And it wouldn't require any >> of the createplan/setrefs changes, I think ... > > Hmm, I am not sure I understand you correctly at this part. In the > current patch, to avoid the run-time (ExecExprInterp) check if we > should detoast and save the datum, I defined 3 extra steps so that > the *extra check itself* is not needed for unnecessary attributes. > for example an datum for int or a detoast datum should not be saved back > to tts_values[*] due to the small_tlist reason. However these steps can > be generated is based on the output of createplan/setrefs changes. take > the INNER_VAR for example: > > In ExecInitExprRec: > > switch (variable->varno) > { > case INNER_VAR: > if (is_join_plan(plan) && > bms_is_member(attnum, > ((JoinState *) state->parent)->inner_pre_detoast_attrs)) > { > scratch.opcode = EEOP_INNER_VAR_TOAST; > } > else > { > scratch.opcode = EEOP_INNER_VAR; > } > } > > The data workflow is: > > 1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c) > decides which Vars should *not* be pre_detoasted because of small_tlist > reason and record it in Plan.forbid_pre_detoast_vars. > > 2. fix_scan_expr (in the setrefs.c) tracks which Vars should be > detoasted for the specific plan node and record them in it. Currently > only Scan and Join nodes support this feature. > > typedef struct Scan > { > ... > /* > * Records of var's varattno - 1 where the Var is accessed indirectly by > * any expression, like a > 3. However a IS [NOT] NULL is not included > * since it doesn't access the tts_values[*] at all. > * > * This is a essential information to figure out which attrs should use > * the pre-detoast-attrs logic. > */ > Bitmapset *reference_attrs; > } Scan; > > typedef struct Join > { > .. > /* > * Records of var's varattno - 1 where the Var is accessed indirectly by > * any expression, like a > 3. However a IS [NOT] NULL is not included > * since it doesn't access the tts_values[*] at all. > * > * This is a essential information to figure out which attrs should use > * the pre-detoast-attrs logic. > */ > Bitmapset *outer_reference_attrs; > Bitmapset *inner_reference_attrs; > } Join; > Is it actually necessary to add new fields to these nodes? Also, the names are not particularly descriptive of the purpose - it'd be better to have "detoast" in the name, instead of generic "reference". > 3. during the InitPlan stage, we maintain the > PlanState.xxx_pre_detoast_attrs and generated different StepOp for them. > > 4. At the ExecExprInterp stage, only the new StepOp do the extra check > to see if the detoast should happen. Other steps doesn't need this > check at all. > > If we avoid the createplan/setref.c changes, probabaly some unrelated > StepOp needs the extra check as well? > > When I worked with the UniqueKey feature, I maintained a > UniqueKey.README to summaried all the dicussed topics in threads, the > README is designed to save the effort for more reviewer, I think I > should apply the same logic for this feature. > Good idea. Either that (a separate README), or a comment in a header of some suitable .c/.h file (I prefer that, because that's kinda obvious when reading the code, I often not notice a README exists next to it). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I see your reply when I started to write a more high level document. Thanks for the step by step help! Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 2/20/24 19:38, Andy Fan wrote: >> >> ... >> >>> I think it should be done the other >>> way around, i.e. the patch should introduce the main feature first >>> (using the traditional Bitmapset), and then add Bitset on top of that. >>> That way we could easily measure the impact and see if it's useful. >> >> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps >> indicate it is too expensive. and after talk with David at [2], I >> introduced bitset and use it here. the test case I used comes from [1]. >> IRCC, there were 5% performance difference because of this. >> >> create table w(a int, b numeric); >> insert into w select i, i from generate_series(1, 1000000)i; >> select b from w where b > 0; >> >> To reproduce the difference, we can replace the bitset_clear() with >> >> bitset_free(slot->pre_detoasted_attrs); >> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts); >> >> in ExecFreePreDetoastDatum. then it works same as Bitmapset. >> > > I understand the bitset was not introduced until v5, after noticing the > bitmapset is not quite efficient. But I still think the patches should > be the other way around, i.e. the main feature first, then the bitset as > an optimization. > > That allows everyone to observe the improvement on their own (without > having to tweak the code), and it also doesn't require commit of the > bitset part before it gets actually used by anything. I start to think this is a better way rather than the opposite. The next version will be: 0001: shared detoast datum feature with high level introduction. 0002: introduce bitset and use it shared-detoast-datum feature, with the test case to show the improvement. >> I will do more test on the memory leak stuff, since there are so many >> operation aginst slot like ExecCopySlot etc, I don't know how to test it >> fully. the method in my mind now is use TPCH with 10GB data size, and >> monitor the query runtime memory usage. >> > > I think this is exactly the "high level design" description that should > be in a comment, somewhere. Got it. >>> Of course, expanded datums are >>> not meant to be long-lived, while "shared detoasted values" are meant to >>> exist (potentially) for the query duration. >> >> hmm, acutally the "shared detoast value" just live in the >> TupleTableSlot->tts_values[*], rather than the whole query duration. The >> simple case is: >> >> SELECT * FROM t WHERE a_text LIKE 'abc%'; >> >> when we scan to the next tuple, the detoast value for the previous tuple >> will be relased. >> > > But if the (detoasted) value is passed to the next executor node, it'll > be kept, right? Yes and only one copy for all the executor nodes. > Unrelated question - could this actually end up being too aggressive? > That is, could it detoast attributes that we end up not needing? For > example, what if the tuple gets eliminated for some reason (e.g. because > of a restriction on the table, or not having a match in a join)? Won't > we detoast the tuple only to throw it away? The detoast datum will have the exactly same lifespan with other tts_values[*]. If the tuple get eliminated for any reason, those detoast datum still exist until the slot is cleared for storing the next tuple. >> typedef struct Join >> { >> .. >> /* >> * Records of var's varattno - 1 where the Var is accessed indirectly by >> * any expression, like a > 3. However a IS [NOT] NULL is not included >> * since it doesn't access the tts_values[*] at all. >> * >> * This is a essential information to figure out which attrs should use >> * the pre-detoast-attrs logic. >> */ >> Bitmapset *outer_reference_attrs; >> Bitmapset *inner_reference_attrs; >> } Join; >> > > Is it actually necessary to add new fields to these nodes? Also, the > names are not particularly descriptive of the purpose - it'd be better > to have "detoast" in the name, instead of generic "reference". Because of the way of the data transformation, I think we must add the fields to keep such inforamtion. Then these information will be used initilize the necessary information in PlanState. maybe I am having a fixed mindset, I can't think out a way to avoid that right now. I used 'reference' rather than detoast is because some implementaion issues. In the createplan.c and setref.c, I can't check the atttyplen effectively, so even a Var with int type is still hold here which may have nothing with detoast. >> >> When I worked with the UniqueKey feature, I maintained a >> UniqueKey.README to summaried all the dicussed topics in threads, the >> README is designed to save the effort for more reviewer, I think I >> should apply the same logic for this feature. >> > > Good idea. Either that (a separate README), or a comment in a header of > some suitable .c/.h file (I prefer that, because that's kinda obvious > when reading the code, I often not notice a README exists next to it). Great, I'd try this from tomorrow. -- Best Regards Andy Fan
Hi, >> >> Good idea. Either that (a separate README), or a comment in a header of >> some suitable .c/.h file (I prefer that, because that's kinda obvious >> when reading the code, I often not notice a README exists next to it). > > Great, I'd try this from tomorrow. I have made it. Currently I choose README because this feature changed createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the high level design to any of them looks inappropriate. So the high level design is here and detailed design for each steps is in the comments around the code. Hope this is helpful! The problem: ------------- In the current expression engine, a toasted datum is detoasted when required, but the result is discarded immediately, either by pfree it immediately or leave it for ResetExprContext. Arguments for which one to use exists sometimes. More serious problem is detoasting is expensive, especially for the data types like jsonb or array, which the value might be very huge. In the blow example, the detoasting happens twice. SELECT jb_col->'a', jb_col->'b' FROM t; Within the shared-detoast-datum, we just need to detoast once for each tuple, and discard it immediately when the tuple is not needed any more. FWIW this issue may existing for small numeric, text as well because of SHORT_TOAST feature where the toast's len using 1 byte rather than 4 bytes. Current Design -------------- The high level design is let createplan.c and setref.c decide which Vars can use this feature, and then the executor save the detoast datum back slot->to tts_values[*] during the ExprEvalStep of EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes: - The existing expression engine read datum from tts_values[*], no any extra work need to be done. - Reuse the lifespan of TupleTableSlot system to manage memory. It is natural to think the detoast datum is a tts_value just that it is in a detoast format. Since we have a clear lifespan for TupleTableSlot already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse them for managing the datoast datum's memory. - The existing projection method can copy the datoasted datum (int64) automatically to the next node's slot, but keeping the ownership unchanged, so only the slot where the detoast really happen take the charge of it's lifespan. So the final data change is adding the below field into TubleTableSlot. typedef struct TupleTableSlot { .. /* * The attributes whose values are the detoasted version in tts_values[*], * if so these memory needs some extra clean-up. These memory can't be put * into ecxt_per_tuple_memory since many of them needs a longer life * span. * * These memory is put into TupleTableSlot.tts_mcxt and be clear * whenever the tts_values[*] is invalidated. */ Bitmapset *pre_detoast_attrs; }; Assuming which Var should use this feature has been decided in createplan.c and setref.c already. The 3 new ExprEvalSteps EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these steps, the below code is used. static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { Datum oldDatum; MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); oldDatum = slot->tts_values[attnum]; slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); Assert(slot->tts_nvalid > attnum); Assert(oldDatum != slot->tts_values[attnum]); slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum); MemoryContextSwitchTo(old); } } Since I don't want to the run-time extra check to see if is a detoast should happen, so introducing 3 new steps. When to free the detoast datum? It depends on when the slot's tts_values[*] is invalidated, ExecClearTuple is the clear one, but any TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use the datum in tts_values[*] so it is time to release them based on slot.pre_detoast_attrs as well. Now comes to the createplan.c/setref.c part, which decides which Vars should use the shared detoast feature. The guideline of this is: 1. It needs a detoast for a given expression. 2. It should not breaks the CP_SMALL_TLIST design. Since we saved the detoast datum back to tts_values[*], which make tuple bigger. if we do this blindly, it would be harmful to the ORDER / HASH style nodes. A high level data flow is: 1. at the createplan.c, we walk the plan tree go gather the CP_SMALL_TLIST because of SORT/HASH style nodes, information and save it to Plan.forbid_pre_detoast_vars via the function set_plan_forbid_pre_detoast_vars_recurse. 2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each expression, so it is a good time to track the attribute number and see if the Var is directly or indirectly accessed. Usually the indirectly access a Var means a detoast would happens, for example an expression like a > 3. However some known expressions like VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs; As a result, the final result is added into the plan node of Scan and Join. typedef struct Scan { /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *reference_attrs; } Scan; typedef struct Join { .. /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *outer_reference_attrs; Bitmapset *inner_reference_attrs; } Join; Note that here I used '_reference_' rather than '_detoast_' is because at this part, I still don't know if it is a toastable attrbiute, which is known at the MakeTupleTableSlot stage. 3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs in ScanState & JoinState, which will be passed into expression engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN} _VAR_TOAST steps are generated finally then everything is connected with ExecSlotDetoastDatum! Testing ------- Case 1: create table t (a numeric); insert into t select i from generate_series(1, 100000)i; cat 1.sql select * from t where a > 0; In this test, the current master run detoast twice for each datum. one in numeric_gt, one in numeric_out. this feature makes the detoast once. pgbench -f 1.sql -n postgres -T 10 -M prepared master: 30.218 ms patched(Bitmapset): 30.881ms Then we can see the perf report as below: - 7.34% 0.00% postgres postgres [.] ExecSlotDetoastDatum (inlined) - ExecSlotDetoastDatum (inlined) - 3.47% bms_add_member - 3.06% bms_make_singleton (inlined) - palloc0 1.30% AllocSetAlloc - 5.99% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined) - ExecFreePreDetoastDatum (inlined) 2.64% bms_next_member 1.17% bms_del_members 0.94% AllocSetFree One of the reasons is because Bitmapset will deallocate its memory when all the bits are deleted due to commit 00b41463c, then we have to allocate memory at the next time when adding a member to it. This kind of action is executed 100000 times in the above workload. Then I introduce bitset data struct (commit 0002) which is pretty like the Bitmapset, but it would not deallocate the memory when all the bits is unset. and use it in this feature (commit 0003). Then the result became to: 28.715ms - 5.22% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined) - ExecFreePreDetoastDatum (inlined) - 2.82% bitset_next_member 1.69% bms_next_member_internal (inlined) 0.95% bitset_next_member 0.66% AllocSetFree Here we can see the expensive calls are bitset_next_member on slot->pre_detoast_attrs and pfree. if we put the detoast datum into a dedicated memory context, then we can save the cost of bitset_next_member since can discard all the memory in once and use MemoryContextReset instead of AllocSetFree (commit 0004). then the result became to 27.840ms. So the final result for case 1: master: 30.218 ms patched(Bitmapset): 30.881ms patched(bitset): 28.715ms latency average(bitset + tts_value_mctx) = 27.840 ms Big jsonbs test: create table b(big jsonb); insert into b select jsonb_object_agg(x::text, random()::text || random()::text || random()::text) from generate_series(1,600) f(x); insert into b select (select big from b) from generate_series(1, 1000)i; explain analyze select big->'1', big->'2', big->'3', big->'5', big->'10' from b; master: 702.224 ms patched: 133.306 ms Memory usage test: I run the workload of tpch scale 10 on against both master and patched versions, the memory usage looks stable. In progress work: I'm still running tpc-h scale 100 to see if anything interesting finding, that is in progress. As for the scale 10: master: 55s patched: 56s The reason is q9 plan changed a bit, the real reason needs some more time. Since this patch doesn't impact on the best path generation, so it should not reasonble for me. -- Best Regards Andy Fan
Attachment
Hi!
I see this to be a very promising initiative, but some issues come into my mind.
When we store and detoast large values, say, 1Gb - that's a very likely scenario,
we have such cases from prod systems - we would end up in using a lot of shared
memory to keep these values alive, only to discard them later. Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted and kept in memory.
What do you think?
Hi, > When we store and detoast large values, say, 1Gb - that's a very likely scenario, > we have such cases from prod systems - we would end up in using a lot of shared > memory to keep these values alive, only to discard them later. First the feature can make sure if the original expression will not detoast it, this feature will not detoast it as well. Based on this, if there is a 1Gb datum, in the past, it took 1GB memory as well, but it may be discarded quicker than the patched version. but patched version will *not* keep it for too long time since once the slot->tts_values[*] is invalidated, the memory will be released immedately. For example: create table t(a text, b int); insert into t select '1-gb-length-text' from generate_series(1, 100); select * from t where a like '%gb%'; The patched version will take 1gb extra memory only. Are you worried about this case or some other cases? > Also, toasted values > are not always being used immediately and as a whole, i.e. jsonb values are fully > detoasted (we're working on this right now) to extract the smallest value from > big json, and these values are not worth keeping in memory. For text values too, > we often do not need the whole value to be detoasted. I'm not sure how often this is true, espeically you metioned text data type. I can't image why people acquire a piece of text and how can we design a toasting system to fulfill such case without detoast the whole as the first step. But for the jsonb or array data type, I think it is more often. However if we design toasting like that, I'm not sure if it would slow down other user case. for example detoasting the whole piece use case. I am justing feeling both way has its user case, kind of heap store and columna store. FWIW, as I shown in the test case, this feature is not only helpful for big datum, it is also helpful for small toast datum. -- Best Regards Andy Fan
On 2/26/24 14:22, Andy Fan wrote: > >... > >> Also, toasted values >> are not always being used immediately and as a whole, i.e. jsonb values are fully >> detoasted (we're working on this right now) to extract the smallest value from >> big json, and these values are not worth keeping in memory. For text values too, >> we often do not need the whole value to be detoasted. > > I'm not sure how often this is true, espeically you metioned text data > type. I can't image why people acquire a piece of text and how can we > design a toasting system to fulfill such case without detoast the whole > as the first step. But for the jsonb or array data type, I think it is > more often. However if we design toasting like that, I'm not sure if it > would slow down other user case. for example detoasting the whole piece > use case. I am justing feeling both way has its user case, kind of heap > store and columna store. > Any substr/starts_with call benefits from this optimization, and such calls don't seem that uncommon. I certainly can imagine people doing this fairly often. In any case, it's a long-standing behavior / optimization, and I don't think we can just dismiss it this quickly. Is there a way to identify cases that are likely to benefit from this slicing, and disable the detoasting for them? We already disable it for other cases, so maybe we can do this here too? Or maybe there's a way to do the detoasting lazily, and only keep the detoasted value when it's requesting the whole value. Or perhaps even better, remember what part we detoasted, and maybe it'll be enough for future requests? I'm not sure how difficult would this be with the proposed approach, which eagerly detoasts the value into tts_values. I think it'd be easier to implement with the TOAST cache I briefly described ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi,
Tomas, we already have a working jsonb partial detoast prototype,
and currently I'm porting it to the recent master. Due to the size
of the changes and very invasive nature it takes a lot of effort,
but it is already done. I'm also trying to make the core patch
less invasive. Actually, it is a subject for a separate topic, as soon
as I make it work on the current master we'll propose it
to the community.
Andy, thank you! I'll check the last patch set out and reply in a day or two.
--
> On 2/26/24 14:22, Andy Fan wrote: >> >>... >> >>> Also, toasted values >>> are not always being used immediately and as a whole, i.e. jsonb values are fully >>> detoasted (we're working on this right now) to extract the smallest value from >>> big json, and these values are not worth keeping in memory. For text values too, >>> we often do not need the whole value to be detoasted. >> >> I'm not sure how often this is true, espeically you metioned text data >> type. I can't image why people acquire a piece of text and how can we >> design a toasting system to fulfill such case without detoast the whole >> as the first step. But for the jsonb or array data type, I think it is >> more often. However if we design toasting like that, I'm not sure if it >> would slow down other user case. for example detoasting the whole piece >> use case. I am justing feeling both way has its user case, kind of heap >> store and columna store. >> > > Any substr/starts_with call benefits from this optimization, and such > calls don't seem that uncommon. I certainly can imagine people doing > this fairly often. This leads me to pay more attention to pg_detoast_datum_slice user case which I did overlook and caused some regression in this case. Thanks! As you said later: > Is there a way to identify cases that are likely to benefit from this > slicing, and disable the detoasting for them? We already disable it for > other cases, so maybe we can do this here too? I think the answer is yes, if our goal is to disable the whole toast for some speical calls like substr/starts_with. In the current patch, we have: /* * increase_level_for_pre_detoast * Check if the given Expr could detoast a Var directly, if yes, * increase the level and return true. otherwise return false; */ static inline void increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx) { /* The following nodes is impossible to detoast a Var directly. */ if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest)) { ctx->level_added = false; } else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION) { /* let's not detoast first so that pg_column_compression works. */ ctx->level_added = false; } while it works, but the blacklist looks not awesome. > In any case, it's a long-standing behavior / > optimization, and I don't think we can just dismiss it this quickly. I agree. So I said both have their user case. and when I say the *both*, I mean the other method is "partial detoast prototype", which Nikita has told me before. while I am sure it has many user case, I'm also feeling it is complex and have many questions in my mind, but I'd like to see the design before asking them. > Or maybe there's a way to do the detoasting lazily, and only keep the > detoasted value when it's requesting the whole value. Or perhaps even > better, remember what part we detoasted, and maybe it'll be enough for > future requests? > > I'm not sure how difficult would this be with the proposed approach, > which eagerly detoasts the value into tts_values. I think it'd be > easier to implement with the TOAST cache I briefly described ... I can understand the benefits of the TOAST cache, but the following issues looks not good to me IIUC: 1). we can't put the result to tts_values[*] since without the planner decision, we don't know if this will break small_tlist logic. But if we put it into the cache only, which means a cache-lookup as a overhead is unavoidable. 2). It is hard to decide which entry should be evicted without attaching it to the TupleTableSlot's life-cycle. then we can't grantee the entry we keep is the entry we will reuse soon? -- Best Regards Andy Fan
Nikita Malakhov <hukutoc@gmail.com> writes: > Hi, > > Tomas, we already have a working jsonb partial detoast prototype, > and currently I'm porting it to the recent master. This is really awesome! Acutally when I talked to MySQL guys, they said MySQL already did this and I admit it can resolve some different issues than the current patch. Just I have many implemetion questions in my mind. > Andy, thank you! I'll check the last patch set out and reply in a day > or two. Thank you for your attention! -- Best Regards Andy Fan
Hi, Andy!
Sorry for the delay, I have had long flights this week.
I've reviewed the patch set, thank you for your efforts.
I have several notes about patch set code, but first of
I'm not sure the overall approach is the best for the task.
As Tomas wrote above, the approach is very invasive
and spreads code related to detoasting among many
parts of code.
Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?
We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?
I'd check this approach in several days and would
report on the result here.
There are also comments on the code itself, I'd write them
a bit later.
--
Hi Nikita, > > Have you considered another one - to alter pg_detoast_datum > (actually, it would be detoast_attr function) and save > detoasted datums in the detoast context derived > from the query context? > > We have just enough information at this step to identify > the datum - toast relation id and value id, and could > keep links to these detoasted values in a, say, linked list > or hash table. Thus we would avoid altering the executor > code and all detoast-related code would reside within > the detoast source files? I think you are talking about the way Tomas provided. I am really afraid that I was thought of too self-opinionated, but I do have some concerns about this approch as I stated here [1], looks my concerns is still not addressed, or the concerns itself are too absurd which is really possible I think? [1] https://www.postgresql.org/message-id/875xyb1a6q.fsf%40163.com -- Best Regards Andy Fan
Hi, Here is a updated version, the main changes are: 1. an shared_detoast_datum.org file which shows the latest desgin and pending items during discussion. 2. I removed the slot->pre_detoast_attrs totally. 3. handle some pg_detoast_datum_slice use case. 4. Some implementation improvement. commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value) Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Sun Mar 3 13:48:25 2024 +0800 shared detoast datum See the overall design & alternative design & testing in shared_detoast_datum.org In the shared_detoast_datum.org, I added the alternative design part for the idea of TOAST cache. -- Best Regards Andy Fan
Attachment
On 3/3/24 02:52, Andy Fan wrote: > > Hi Nikita, > >> >> Have you considered another one - to alter pg_detoast_datum >> (actually, it would be detoast_attr function) and save >> detoasted datums in the detoast context derived >> from the query context? >> >> We have just enough information at this step to identify >> the datum - toast relation id and value id, and could >> keep links to these detoasted values in a, say, linked list >> or hash table. Thus we would avoid altering the executor >> code and all detoast-related code would reside within >> the detoast source files? > > I think you are talking about the way Tomas provided. I am really > afraid that I was thought of too self-opinionated, but I do have some > concerns about this approch as I stated here [1], looks my concerns is > still not addressed, or the concerns itself are too absurd which is > really possible I think? > I'm not sure I understand your concerns. I can't speak for others, but I did not consider you and your proposals too self-opinionated. You did propose a solution that you consider the right one. That's fine. People will question that and suggest possible alternatives. That's fine too, it's why we have this list in the first place. FWIW I'm not somehow sure the approach I suggested is guaranteed to be better than "your" approach. Maybe there's some issue that I missed, for example. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/26/24 16:29, Andy Fan wrote: > > ...> > I can understand the benefits of the TOAST cache, but the following > issues looks not good to me IIUC: > > 1). we can't put the result to tts_values[*] since without the planner > decision, we don't know if this will break small_tlist logic. But if we > put it into the cache only, which means a cache-lookup as a overhead is > unavoidable. True - if you're comparing having the detoasted value in tts_values[*] directly with having to do a lookup in a cache, then yes, there's a bit of an overhead. But I think from the discussion it's clear having to detoast the value into tts_values[*] has some weaknesses too, in particular: - It requires decisions which attributes to detoast eagerly, which is quite invasive (having to walk the plan, ...). - I'm sure there will be cases where we choose to not detoast, but it would be beneficial to detoast. - Detoasting just the initial slices does not seem compatible with this. IMHO the overhead of the cache lookup would be negligible compared to the repeated detoasting of the value (which is the current baseline). I somewhat doubt the difference compared to tts_values[*] will be even measurable. > > 2). It is hard to decide which entry should be evicted without attaching > it to the TupleTableSlot's life-cycle. then we can't grantee the entry > we keep is the entry we will reuse soon? > True. But is that really a problem? I imagined we'd set some sort of memory limit on the cache (work_mem?), and evict oldest entries. So the entries would eventually get evicted, and the memory limit would ensure we don't consume arbitrary amounts of memory. We could also add some "slot callback" but that seems unnecessary. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/3/24 07:10, Andy Fan wrote: > > Hi, > > Here is a updated version, the main changes are: > > 1. an shared_detoast_datum.org file which shows the latest desgin and > pending items during discussion. > 2. I removed the slot->pre_detoast_attrs totally. > 3. handle some pg_detoast_datum_slice use case. > 4. Some implementation improvement. > I only very briefly skimmed the patch, and I guess most of my earlier comments still apply. But I'm a bit surprised the patch needs to pass a MemoryContext to so many places as a function argument - that seems to go against how we work with memory contexts. Doubly so because it seems to only ever pass CurrentMemoryContext anyway. So what's that about? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 3/3/24 02:52, Andy Fan wrote: >> >> Hi Nikita, >> >>> >>> Have you considered another one - to alter pg_detoast_datum >>> (actually, it would be detoast_attr function) and save >>> detoasted datums in the detoast context derived >>> from the query context? >>> >>> We have just enough information at this step to identify >>> the datum - toast relation id and value id, and could >>> keep links to these detoasted values in a, say, linked list >>> or hash table. Thus we would avoid altering the executor >>> code and all detoast-related code would reside within >>> the detoast source files? >> >> I think you are talking about the way Tomas provided. I am really >> afraid that I was thought of too self-opinionated, but I do have some >> concerns about this approch as I stated here [1], looks my concerns is >> still not addressed, or the concerns itself are too absurd which is >> really possible I think? >> > > I can't speak for others, but I did not consider you and your > proposals too self-opinionated. This would be really great to know:) > You did > propose a solution that you consider the right one. That's fine. People > will question that and suggest possible alternatives. That's fine too, > it's why we have this list in the first place. I agree. -- Best Regards Andy Fan
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 2/26/24 16:29, Andy Fan wrote: >> >> ...> >> I can understand the benefits of the TOAST cache, but the following >> issues looks not good to me IIUC: >> >> 1). we can't put the result to tts_values[*] since without the planner >> decision, we don't know if this will break small_tlist logic. But if we >> put it into the cache only, which means a cache-lookup as a overhead is >> unavoidable. > > True - if you're comparing having the detoasted value in tts_values[*] > directly with having to do a lookup in a cache, then yes, there's a bit > of an overhead. > > But I think from the discussion it's clear having to detoast the value > into tts_values[*] has some weaknesses too, in particular: > > - It requires decisions which attributes to detoast eagerly, which is > quite invasive (having to walk the plan, ...). > > - I'm sure there will be cases where we choose to not detoast, but it > would be beneficial to detoast. > > - Detoasting just the initial slices does not seem compatible with this. > > IMHO the overhead of the cache lookup would be negligible compared to > the repeated detoasting of the value (which is the current baseline). I > somewhat doubt the difference compared to tts_values[*] will be even > measurable. > >> >> 2). It is hard to decide which entry should be evicted without attaching >> it to the TupleTableSlot's life-cycle. then we can't grantee the entry >> we keep is the entry we will reuse soon? >> > > True. But is that really a problem? I imagined we'd set some sort of > memory limit on the cache (work_mem?), and evict oldest entries. So the > entries would eventually get evicted, and the memory limit would ensure > we don't consume arbitrary amounts of memory. > Here is a copy from the shared_detoast_datum.org in the patch. I am feeling about when / which entry to free is a key problem and run-time (detoast_attr) overhead vs createplan.c overhead is a small difference as well. the overhead I paid for createplan.c/setref.c looks not huge as well. """ A alternative design: toast cache --------------------------------- This method is provided by Tomas during the review process. IIUC, this method would maintain a local HTAB which map a toast datum to a detoast datum and the entry is maintained / used in detoast_attr function. Within this method, the overall design is pretty clear and the code modification can be controlled in toasting system only. I assumed that releasing all of the memory at the end of executor once is not an option since it may consumed too many memory. Then, when and which entry to release becomes a trouble for me. For example: QUERY PLAN ------------------------------ Nested Loop Join Filter: (t1.a = t2.a) -> Seq Scan on t1 -> Seq Scan on t2 (4 rows) In this case t1.a needs a longer lifespan than t2.a since it is in outer relation. Without the help from slot's life-cycle system, I can't think out a answer for the above question. Another difference between the 2 methods is my method have many modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but it can save some run-time effort like hash_search find / enter run-time in method 2 since I put them directly into tts_values[*]. I'm not sure the factor 2 makes some real measurable difference in real case, so my current concern mainly comes from factor 1. """ -- Best Regards Andy Fan
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 3/3/24 07:10, Andy Fan wrote: >> >> Hi, >> >> Here is a updated version, the main changes are: >> >> 1. an shared_detoast_datum.org file which shows the latest desgin and >> pending items during discussion. >> 2. I removed the slot->pre_detoast_attrs totally. >> 3. handle some pg_detoast_datum_slice use case. >> 4. Some implementation improvement. >> > > I only very briefly skimmed the patch, and I guess most of my earlier > comments still apply. Yes, the overall design is not changed. > But I'm a bit surprised the patch needs to pass a > MemoryContext to so many places as a function argument - that seems to > go against how we work with memory contexts. Doubly so because it seems > to only ever pass CurrentMemoryContext anyway. So what's that about? I think you are talking about the argument like this: /* ---------- - * detoast_attr - + * detoast_attr_ext - * * Public entry point to get back a toasted value from compression * or external storage. The result is always non-extended varlena form. * + * ctx: The memory context which the final value belongs to. + * * Note some callers assume that if the input is an EXTERNAL or COMPRESSED * datum, the result will be a pfree'able chunk. * ---------- */ +extern struct varlena * +detoast_attr_ext(struct varlena *attr, MemoryContext ctx) This is mainly because 'detoast_attr' will apply more memory than the "final detoast datum" , for example the code to scan toast relation needs some memory as well, and what I want is just keeping the memory for the final detoast datum so that other memory can be released sooner, so I added the function argument for that. -- Best Regards Andy Fan
On 3/4/24 02:29, Andy Fan wrote: > > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > >> On 3/3/24 07:10, Andy Fan wrote: >>> >>> Hi, >>> >>> Here is a updated version, the main changes are: >>> >>> 1. an shared_detoast_datum.org file which shows the latest desgin and >>> pending items during discussion. >>> 2. I removed the slot->pre_detoast_attrs totally. >>> 3. handle some pg_detoast_datum_slice use case. >>> 4. Some implementation improvement. >>> >> >> I only very briefly skimmed the patch, and I guess most of my earlier >> comments still apply. > > Yes, the overall design is not changed. > >> But I'm a bit surprised the patch needs to pass a >> MemoryContext to so many places as a function argument - that seems to >> go against how we work with memory contexts. Doubly so because it seems >> to only ever pass CurrentMemoryContext anyway. So what's that about? > > I think you are talking about the argument like this: > > /* ---------- > - * detoast_attr - > + * detoast_attr_ext - > * > * Public entry point to get back a toasted value from compression > * or external storage. The result is always non-extended varlena form. > * > + * ctx: The memory context which the final value belongs to. > + * > * Note some callers assume that if the input is an EXTERNAL or COMPRESSED > * datum, the result will be a pfree'able chunk. > * ---------- > */ > > +extern struct varlena * > +detoast_attr_ext(struct varlena *attr, MemoryContext ctx) > > This is mainly because 'detoast_attr' will apply more memory than the > "final detoast datum" , for example the code to scan toast relation > needs some memory as well, and what I want is just keeping the memory > for the final detoast datum so that other memory can be released sooner, > so I added the function argument for that. > What exactly does detoast_attr allocate in order to scan toast relation? Does that happen in master, or just with the patch? If with master, I suggest to ignore that / treat that as a separate issue and leave it for a different patch. In any case, the custom is to allocate results in the context that is set in CurrentMemoryContext at the moment of the call, and if there's substantial amount of allocations that we want to free soon, we either do that by explicit pfree() calls, or create a small temporary context in the function (but that's more expensive). I don't think we should invent a new convention where the context is passed as an argument, unless absolutely necessary. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >>> But I'm a bit surprised the patch needs to pass a >>> MemoryContext to so many places as a function argument - that seems to >>> go against how we work with memory contexts. Doubly so because it seems >>> to only ever pass CurrentMemoryContext anyway. So what's that about? >> >> I think you are talking about the argument like this: >> >> /* ---------- >> - * detoast_attr - >> + * detoast_attr_ext - >> * >> * Public entry point to get back a toasted value from compression >> * or external storage. The result is always non-extended varlena form. >> * >> + * ctx: The memory context which the final value belongs to. >> + * >> * Note some callers assume that if the input is an EXTERNAL or COMPRESSED >> * datum, the result will be a pfree'able chunk. >> * ---------- >> */ >> >> +extern struct varlena * >> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx) >> >> This is mainly because 'detoast_attr' will apply more memory than the >> "final detoast datum" , for example the code to scan toast relation >> needs some memory as well, and what I want is just keeping the memory >> for the final detoast datum so that other memory can be released sooner, >> so I added the function argument for that. >> > > What exactly does detoast_attr allocate in order to scan toast relation? > Does that happen in master, or just with the patch? It is in the current master, for example the TupleTableSlot creation needed by scanning the toast relation needs a memory allocating. > If with master, I > suggest to ignore that / treat that as a separate issue and leave it for > a different patch. OK, I can make it as a seperate commit in the next version. > In any case, the custom is to allocate results in the context that is > set in CurrentMemoryContext at the moment of the call, and if there's > substantial amount of allocations that we want to free soon, we either > do that by explicit pfree() calls, or create a small temporary context > in the function (but that's more expensive). > > I don't think we should invent a new convention where the context is > passed as an argument, unless absolutely necessary. Hmm, in this case, if we don't add the new argument, we have to get the detoast datum in Context-1 and copy it to the desired memory context, which is the thing I want to avoid. I think we have a same decision to make on the TOAST cache method as well. -- Best Regards Andy Fan
On 3/4/24 02:23, Andy Fan wrote: > > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > >> On 2/26/24 16:29, Andy Fan wrote: >>> >>> ...> >>> I can understand the benefits of the TOAST cache, but the following >>> issues looks not good to me IIUC: >>> >>> 1). we can't put the result to tts_values[*] since without the planner >>> decision, we don't know if this will break small_tlist logic. But if we >>> put it into the cache only, which means a cache-lookup as a overhead is >>> unavoidable. >> >> True - if you're comparing having the detoasted value in tts_values[*] >> directly with having to do a lookup in a cache, then yes, there's a bit >> of an overhead. >> >> But I think from the discussion it's clear having to detoast the value >> into tts_values[*] has some weaknesses too, in particular: >> >> - It requires decisions which attributes to detoast eagerly, which is >> quite invasive (having to walk the plan, ...). >> >> - I'm sure there will be cases where we choose to not detoast, but it >> would be beneficial to detoast. >> >> - Detoasting just the initial slices does not seem compatible with this. >> >> IMHO the overhead of the cache lookup would be negligible compared to >> the repeated detoasting of the value (which is the current baseline). I >> somewhat doubt the difference compared to tts_values[*] will be even >> measurable. >> >>> >>> 2). It is hard to decide which entry should be evicted without attaching >>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry >>> we keep is the entry we will reuse soon? >>> >> >> True. But is that really a problem? I imagined we'd set some sort of >> memory limit on the cache (work_mem?), and evict oldest entries. So the >> entries would eventually get evicted, and the memory limit would ensure >> we don't consume arbitrary amounts of memory. >> > > Here is a copy from the shared_detoast_datum.org in the patch. I am > feeling about when / which entry to free is a key problem and run-time > (detoast_attr) overhead vs createplan.c overhead is a small difference > as well. the overhead I paid for createplan.c/setref.c looks not huge as > well. I decided to whip up a PoC patch implementing this approach, to get a better idea of how it compares to the original proposal, both in terms of performance and code complexity. Attached are two patches that both add a simple cache in detoast.c, but differ in when exactly the caching happens. toast-cache-v1 - caching happens in toast_fetch_datum, which means this happens before decompression toast-cache-v2 - caching happens in detoast_attr, after decompression I started with v1, but that did not help with the example workloads (from the original post) at all. Only after I changed this to cache decompressed data (in v2) it became competitive. There's GUC to enable the cache (enable_toast_cache, off by default), to make experimentation easier. The cache is invalidated at the end of a transaction - I think this should be OK, because the rows may be deleted but can't be cleaned up (or replaced with a new TOAST value). This means the cache could cover multiple queries - not sure if that's a good thing or bad thing. I haven't implemented any sort of cache eviction. I agree that's a hard problem in general, but I doubt we can do better than LRU. I also didn't implement any memory limit. FWIW I think it's a fairly serious issue Andy's approach does not have any way to limit the memory. It will detoasts the values eagerly, but what if the rows have multiple large values? What if we end up keeping multiple such rows (from different parts of the plan) at once? I also haven't implemented caching for sliced data. I think modifying the code to support this should not be difficult - it'd require tracking how much data we have in the cache, and considering that during lookup and when adding entries to cache. I've implemented the cache as "global". Maybe it should be tied to query or executor state, but then it'd be harder to access it from detoast.c (we'd have to pass the cache pointer in some way, and it wouldn't work for use cases that don't go through executor). I think implementation-wise this is pretty non-invasive. Performance-wise, these patches are slower than with Andy's patch. For example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and v2 at ~150ms. I'm sure there's a number of optimization opportunities and v2 could get v2 closer to the 100ms. v1 is not competitive at all in this jsonb/Q1 test - it's just as slow as master, because the data set is small enough to be fully cached, so there's no I/O and it's the decompression is the actual bottleneck. That being said, I'm not convinced v1 would be a bad choice for some cases. If there's memory pressure enough to evict TOAST, it's quite possible the I/O would become the bottleneck. At which point it might be a good trade off to cache compressed data, because then we'd cache more detoasted values. OTOH it's likely the access to TOAST values is localized (in temporal sense), i.e. we read it from disk, detoast it a couple times in a short time interval, and then move to the next row. That'd mean v2 is probably the correct trade off. Not sure. > > """ > A alternative design: toast cache > --------------------------------- > > This method is provided by Tomas during the review process. IIUC, this > method would maintain a local HTAB which map a toast datum to a detoast > datum and the entry is maintained / used in detoast_attr > function. Within this method, the overall design is pretty clear and the > code modification can be controlled in toasting system only. > Right. > I assumed that releasing all of the memory at the end of executor once > is not an option since it may consumed too many memory. Then, when and > which entry to release becomes a trouble for me. For example: > > QUERY PLAN > ------------------------------ > Nested Loop > Join Filter: (t1.a = t2.a) > -> Seq Scan on t1 > -> Seq Scan on t2 > (4 rows) > > In this case t1.a needs a longer lifespan than t2.a since it is > in outer relation. Without the help from slot's life-cycle system, I > can't think out a answer for the above question. > This is true, but how likely such plans are? I mean, surely no one would do nested loop with sequential scans on reasonably large tables, so how representative this example is? Also, this leads me to the need of having some sort of memory limit. If we may be keeping entries for extended periods of time, and we don't have any way to limit the amount of memory, that does not seem great. AFAIK if we detoast everything into tts_values[] there's no way to implement and enforce such limit. What happens if there's a row with multiple large-ish TOAST values? What happens if those rows are in different (and distant) parts of the plan? It seems far easier to limit the memory with the toast cache. > Another difference between the 2 methods is my method have many > modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but > it can save some run-time effort like hash_search find / enter run-time > in method 2 since I put them directly into tts_values[*]. > > I'm not sure the factor 2 makes some real measurable difference in real > case, so my current concern mainly comes from factor 1. > """ > This seems a bit dismissive of the (possible) issue. It'd be good to do some measurements, especially on simple queries that can't benefit from the detoasting (e.g. because there's no varlena attribute). In any case, my concern is more about having to do this when creating the plan at all, the code complexity etc. Not just because it might have performance impact. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> > I decided to whip up a PoC patch implementing this approach, to get a > better idea of how it compares to the original proposal, both in terms > of performance and code complexity. > > Attached are two patches that both add a simple cache in detoast.c, but > differ in when exactly the caching happens. > > toast-cache-v1 - caching happens in toast_fetch_datum, which means this > happens before decompression > > toast-cache-v2 - caching happens in detoast_attr, after decompression ... > > I think implementation-wise this is pretty non-invasive. .. > > Performance-wise, these patches are slower than with Andy's patch. For > example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and > v2 at ~150ms. I'm sure there's a number of optimization opportunities > and v2 could get v2 closer to the 100ms. > > v1 is not competitive at all in this jsonb/Q1 test - it's just as slow > as master, because the data set is small enough to be fully cached, so > there's no I/O and it's the decompression is the actual bottleneck. > > That being said, I'm not convinced v1 would be a bad choice for some > cases. If there's memory pressure enough to evict TOAST, it's quite > possible the I/O would become the bottleneck. At which point it might be > a good trade off to cache compressed data, because then we'd cache more > detoasted values. > > OTOH it's likely the access to TOAST values is localized (in temporal > sense), i.e. we read it from disk, detoast it a couple times in a short > time interval, and then move to the next row. That'd mean v2 is probably > the correct trade off. I don't have different thought about the above statement and Thanks for the PoC code which would make later testing much easier. >> >> """ >> A alternative design: toast cache >> --------------------------------- >> >> This method is provided by Tomas during the review process. IIUC, this >> method would maintain a local HTAB which map a toast datum to a detoast >> datum and the entry is maintained / used in detoast_attr >> function. Within this method, the overall design is pretty clear and the >> code modification can be controlled in toasting system only. >> > > Right. > >> I assumed that releasing all of the memory at the end of executor once >> is not an option since it may consumed too many memory. Then, when and >> which entry to release becomes a trouble for me. For example: >> >> QUERY PLAN >> ------------------------------ >> Nested Loop >> Join Filter: (t1.a = t2.a) >> -> Seq Scan on t1 >> -> Seq Scan on t2 >> (4 rows) >> >> In this case t1.a needs a longer lifespan than t2.a since it is >> in outer relation. Without the help from slot's life-cycle system, I >> can't think out a answer for the above question. >> > > This is true, but how likely such plans are? I mean, surely no one would > do nested loop with sequential scans on reasonably large tables, so how > representative this example is? Acutally this is a simplest Join case, we still have same problem like Nested Loop + Index Scan which will be pretty common. > Also, this leads me to the need of having some sort of memory limit. If > we may be keeping entries for extended periods of time, and we don't > have any way to limit the amount of memory, that does not seem great. > > AFAIK if we detoast everything into tts_values[] there's no way to > implement and enforce such limit. What happens if there's a row with > multiple large-ish TOAST values? What happens if those rows are in > different (and distant) parts of the plan? I think this can be done by tracking the memory usage on EState level or global variable level and disable it if it exceeds the limits and resume it when we free the detoast datum when we don't need it. I think no other changes need to be done. > It seems far easier to limit the memory with the toast cache. I think the memory limit and entry eviction is the key issue now. IMO, there are still some difference when both methods can support memory limit. The reason is my patch can grantee the cached memory will be reused, so if we set the limit to 10MB, we know all the 10MB is useful, but the TOAST cache method, probably can't grantee that, then when we want to make it effecitvely, we have to set a higher limit for this. >> Another difference between the 2 methods is my method have many >> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but >> it can save some run-time effort like hash_search find / enter run-time >> in method 2 since I put them directly into tts_values[*]. >> >> I'm not sure the factor 2 makes some real measurable difference in real >> case, so my current concern mainly comes from factor 1. >> """ >> > > This seems a bit dismissive of the (possible) issue. Hmm, sorry about this, that is not my intention:( > It'd be good to do > some measurements, especially on simple queries that can't benefit from > the detoasting (e.g. because there's no varlena attribute). This testing for the queries which have no varlena attribute was done at the very begining of this thread, and for now, the code is much more nicer for this sistuation. all the changes in execExpr.c execExprInterp.c has no impact on this. Only the plan walker codes matter. Here is a test based on tenk1. q1: explain select count(*) from tenk1 where odd > 10 and even > 30; q2: select count(*) from tenk1 where odd > 10 and even > 30; pgbench -n -f qx.sql regression -T10 | Query | master (ms) | patched (ms) | |-------+-------------+--------------| | q1 | 0.129 | 0.129 | | q2 | 1.876 | 1.870 | there are some error for patched-q2 combination, but at least it can show it doesn't cause noticable regression. > In any case, my concern is more about having to do this when creating > the plan at all, the code complexity etc. Not just because it might have > performance impact. I think the main trade-off is TOAST cache method is pretty non-invasive but can't control the eviction well, the impacts includes: 1. may evicting the datum we want and kept the datum we don't need. 2. more likely to use up all the memory which is allowed. for example: if we set the limit to 1MB, then we kept more data which will be not used and then consuming all of the 1MB. My method is resolving this with some helps from other modules (kind of invasive) but can control the eviction well and use the memory as less as possible. -- Best Regards Andy Fan
On 3/4/24 18:08, Andy Fan wrote: > ... >> >>> I assumed that releasing all of the memory at the end of executor once >>> is not an option since it may consumed too many memory. Then, when and >>> which entry to release becomes a trouble for me. For example: >>> >>> QUERY PLAN >>> ------------------------------ >>> Nested Loop >>> Join Filter: (t1.a = t2.a) >>> -> Seq Scan on t1 >>> -> Seq Scan on t2 >>> (4 rows) >>> >>> In this case t1.a needs a longer lifespan than t2.a since it is >>> in outer relation. Without the help from slot's life-cycle system, I >>> can't think out a answer for the above question. >>> >> >> This is true, but how likely such plans are? I mean, surely no one would >> do nested loop with sequential scans on reasonably large tables, so how >> representative this example is? > > Acutally this is a simplest Join case, we still have same problem like > Nested Loop + Index Scan which will be pretty common. > Yes, I understand there are cases where LRU eviction may not be the best choice - like here, where the "t1" should stay in the case. But there are also cases where this is the wrong choice, and LRU would be better. For example a couple paragraphs down you suggest to enforce the memory limit by disabling detoasting if the memory limit is reached. That means the detoasting can get disabled because there's a single access to the attribute somewhere "up the plan tree". But what if the other attributes (which now won't be detoasted) are accessed many times until then? I think LRU is a pretty good "default" algorithm if we don't have a very good idea of the exact life span of the values, etc. Which is why other nodes (e.g. Memoize) use LRU too. But I wonder if there's a way to count how many times an attribute is accessed (or is likely to be). That might be used to inform a better eviction strategy. Also, we don't need to evict the whole entry - we could evict just the data part (guaranteed to be fairly large), but keep the header, and keep the counts, expected number of hits, and other info. And use this to e.g. release entries that reached the expected number of hits. But I'd probably start with LRU and only do this as an improvement later. >> Also, this leads me to the need of having some sort of memory limit. If >> we may be keeping entries for extended periods of time, and we don't >> have any way to limit the amount of memory, that does not seem great. >> >> AFAIK if we detoast everything into tts_values[] there's no way to >> implement and enforce such limit. What happens if there's a row with >> multiple large-ish TOAST values? What happens if those rows are in >> different (and distant) parts of the plan? > > I think this can be done by tracking the memory usage on EState level > or global variable level and disable it if it exceeds the limits and > resume it when we free the detoast datum when we don't need it. I think > no other changes need to be done. > That seems like a fair amount of additional complexity. And what if the toasted values are accessed in context without EState (I haven't checked how common / important that is)? And I'm not sure just disabling the detoast as a way to enforce a memory limit, as explained earlier. >> It seems far easier to limit the memory with the toast cache. > > I think the memory limit and entry eviction is the key issue now. IMO, > there are still some difference when both methods can support memory > limit. The reason is my patch can grantee the cached memory will be > reused, so if we set the limit to 10MB, we know all the 10MB is > useful, but the TOAST cache method, probably can't grantee that, then > when we want to make it effecitvely, we have to set a higher limit for > this. > Can it actually guarantee that? It can guarantee the slot may be used, but I don't see how could it guarantee the detoasted value will be used. We may be keeping the slot for other reasons. And even if it could guarantee the detoasted value will be used, does that actually prove it's better to keep that value? What if it's only used once, but it's blocking detoasting of values that will be used 10x that? If we detoast a 10MB value in the outer side of the Nest Loop, what if the inner path has multiple accesses to another 10MB value that now can't be detoasted (as a shared value)? > >>> Another difference between the 2 methods is my method have many >>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but >>> it can save some run-time effort like hash_search find / enter run-time >>> in method 2 since I put them directly into tts_values[*]. >>> >>> I'm not sure the factor 2 makes some real measurable difference in real >>> case, so my current concern mainly comes from factor 1. >>> """ >>> >> >> This seems a bit dismissive of the (possible) issue. > > Hmm, sorry about this, that is not my intention:( > No need to apologize. >> It'd be good to do >> some measurements, especially on simple queries that can't benefit from >> the detoasting (e.g. because there's no varlena attribute). > > This testing for the queries which have no varlena attribute was done at > the very begining of this thread, and for now, the code is much more > nicer for this sistuation. all the changes in execExpr.c > execExprInterp.c has no impact on this. Only the plan walker codes > matter. Here is a test based on tenk1. > > q1: explain select count(*) from tenk1 where odd > 10 and even > 30; > q2: select count(*) from tenk1 where odd > 10 and even > 30; > > pgbench -n -f qx.sql regression -T10 > > | Query | master (ms) | patched (ms) | > |-------+-------------+--------------| > | q1 | 0.129 | 0.129 | > | q2 | 1.876 | 1.870 | > > there are some error for patched-q2 combination, but at least it can > show it doesn't cause noticable regression. > OK, I'll try to do some experiments with these cases too. >> In any case, my concern is more about having to do this when creating >> the plan at all, the code complexity etc. Not just because it might have >> performance impact. > > I think the main trade-off is TOAST cache method is pretty non-invasive > but can't control the eviction well, the impacts includes: > 1. may evicting the datum we want and kept the datum we don't need. This applies to any eviction algorithm, not just LRU. Ultimately what matters is whether we have in the cache the most often used values, i.e. the hit ratio (perhaps in combination with how expensive detoasting that particular entry was). > 2. more likely to use up all the memory which is allowed. for example: > if we set the limit to 1MB, then we kept more data which will be not > used and then consuming all of the 1MB. > > My method is resolving this with some helps from other modules (kind of > invasive) but can control the eviction well and use the memory as less > as possible. > Is the memory usage really an issue? Sure, it'd be nice to evict entries as soon as we can prove they are not needed anymore, but if the cache limit is set to 1MB it's not really a problem to use 1MB. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> >> 2. more likely to use up all the memory which is allowed. for example: >> if we set the limit to 1MB, then we kept more data which will be not >> used and then consuming all of the 1MB. >> >> My method is resolving this with some helps from other modules (kind of >> invasive) but can control the eviction well and use the memory as less >> as possible. >> > > Is the memory usage really an issue? Sure, it'd be nice to evict entries > as soon as we can prove they are not needed anymore, but if the cache > limit is set to 1MB it's not really a problem to use 1MB. This might be a key point which leads us to some different directions, so I want to explain more about this, to see if we can get some agreement here. It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB, 100MB. In my current case it is 40MB at least. Less memory limit causes cache ineffecitve, high memory limit cause potential memory use-up issue in the TOAST cache design. But in my method, even we set a higher value, it just limits the user case it really (nearly) needed, and it would not cache more values util the limit is hit. This would make a noticable difference when we want to set a high limit and we have some high active sessions, like 100 * 40MB = 4GB. > On 3/4/24 18:08, Andy Fan wrote: >> ... >>> >>>> I assumed that releasing all of the memory at the end of executor once >>>> is not an option since it may consumed too many memory. Then, when and >>>> which entry to release becomes a trouble for me. For example: >>>> >>>> QUERY PLAN >>>> ------------------------------ >>>> Nested Loop >>>> Join Filter: (t1.a = t2.a) >>>> -> Seq Scan on t1 >>>> -> Seq Scan on t2 >>>> (4 rows) >>>> >>>> In this case t1.a needs a longer lifespan than t2.a since it is >>>> in outer relation. Without the help from slot's life-cycle system, I >>>> can't think out a answer for the above question. >>>> >>> >>> This is true, but how likely such plans are? I mean, surely no one would >>> do nested loop with sequential scans on reasonably large tables, so how >>> representative this example is? >> >> Acutally this is a simplest Join case, we still have same problem like >> Nested Loop + Index Scan which will be pretty common. >> > > Yes, I understand there are cases where LRU eviction may not be the best > choice - like here, where the "t1" should stay in the case. But there > are also cases where this is the wrong choice, and LRU would be better. > > For example a couple paragraphs down you suggest to enforce the memory > limit by disabling detoasting if the memory limit is reached. That means > the detoasting can get disabled because there's a single access to the > attribute somewhere "up the plan tree". But what if the other attributes > (which now won't be detoasted) are accessed many times until then? I am not sure I can't follow up here, but I want to explain more about the disable-detoast-sharing logic when the memory limit is hit. When this happen, the detoast sharing is disabled, but since the detoast datum will be released every soon when the slot->tts_values[*] is discard, then the 'disable' turn to 'enable' quickly. So It is not true that once it is get disabled, it can't be enabled any more for the given query. > I think LRU is a pretty good "default" algorithm if we don't have a very > good idea of the exact life span of the values, etc. Which is why other > nodes (e.g. Memoize) use LRU too. > But I wonder if there's a way to count how many times an attribute is > accessed (or is likely to be). That might be used to inform a better > eviction strategy. Yes, but in current issue we can get a better esitimation with the help of plan shape and Memoize depends on some planner information as well. If we bypass the planner information and try to resolve it at the cache level, the code may become to complex as well and all the cost is run time overhead while the other way is a planning timie overhead. > Also, we don't need to evict the whole entry - we could evict just the > data part (guaranteed to be fairly large), but keep the header, and keep > the counts, expected number of hits, and other info. And use this to > e.g. release entries that reached the expected number of hits. But I'd > probably start with LRU and only do this as an improvement later. A great lession learnt here, thanks for sharing this! As for the current user case what I want to highlight is in the current user case, we are "caching" "user data" "locally". USER DATA indicates it might be very huge, it is not common to have a 1M tables, but it is much common we have 1M Tuples in one scan, so keeping the header might extra memory usage as well, like 10M * 24 bytes = 240MB. LOCALLY means it is not friend to multi active sessions. CACHE indicates it is hard to evict correctly. My method also have the USER DATA, LOCALLY attributes, but it would be better at eviction. eviction then have lead to memory usage issue which is discribed at the beginning of this writing. >>> Also, this leads me to the need of having some sort of memory limit. If >>> we may be keeping entries for extended periods of time, and we don't >>> have any way to limit the amount of memory, that does not seem great. >>> >>> AFAIK if we detoast everything into tts_values[] there's no way to >>> implement and enforce such limit. What happens if there's a row with >>> multiple large-ish TOAST values? What happens if those rows are in >>> different (and distant) parts of the plan? >> >> I think this can be done by tracking the memory usage on EState level >> or global variable level and disable it if it exceeds the limits and >> resume it when we free the detoast datum when we don't need it. I think >> no other changes need to be done. >> > > That seems like a fair amount of additional complexity. And what if the > toasted values are accessed in context without EState (I haven't checked > how common / important that is)? > > And I'm not sure just disabling the detoast as a way to enforce a memory > limit, as explained earlier. > >>> It seems far easier to limit the memory with the toast cache. >> >> I think the memory limit and entry eviction is the key issue now. IMO, >> there are still some difference when both methods can support memory >> limit. The reason is my patch can grantee the cached memory will be >> reused, so if we set the limit to 10MB, we know all the 10MB is >> useful, but the TOAST cache method, probably can't grantee that, then >> when we want to make it effecitvely, we have to set a higher limit for >> this. >> > > Can it actually guarantee that? It can guarantee the slot may be used, > but I don't see how could it guarantee the detoasted value will be used. > We may be keeping the slot for other reasons. And even if it could > guarantee the detoasted value will be used, does that actually prove > it's better to keep that value? What if it's only used once, but it's > blocking detoasting of values that will be used 10x that? > > If we detoast a 10MB value in the outer side of the Nest Loop, what if > the inner path has multiple accesses to another 10MB value that now > can't be detoasted (as a shared value)? Grarantee may be wrong word. The difference in my mind are: 1. plan shape have better potential to know the user case of datum, since we know the plan tree and knows the rows pass to a given node. 2. Planning time effort is cheaper than run-time effort. 3. eviction in my method is not as important as it is in TOAST cache method since it is reset per slot, so usually it doesn't hit limit in fact. But as a cache, it does. 4. use up to memory limit we set in TOAST cache case. >>> In any case, my concern is more about having to do this when creating >>> the plan at all, the code complexity etc. Not just because it might have >>> performance impact. >> >> I think the main trade-off is TOAST cache method is pretty non-invasive >> but can't control the eviction well, the impacts includes: >> 1. may evicting the datum we want and kept the datum we don't need. > > This applies to any eviction algorithm, not just LRU. Ultimately what > matters is whether we have in the cache the most often used values, i.e. > the hit ratio (perhaps in combination with how expensive detoasting that > particular entry was). Correct, just that I am doubtful about design a LOCAL CACHE for USER DATA with the reason I described above. At last, thanks for your attention, really appreciated about it! -- Best Regards Andy Fan
Hi,
Tomas, sorry for confusion, in my previous message I meant exactly
the same approach you've posted above, and came up with almost
the same implementation.
Thank you very much for your attention to this thread!
I've asked Andy about this approach because of the same reasons you
mentioned - it keeps cache code small, localized and easy to maintain.
The question that worries me is the memory limit, and I think that cache
lookup and entry invalidation should be done in toast_tuple_externalize
code, for the case the value has been detoasted previously is updated
in the same query, like
UPDATE test SET value = value || '...';
I've added cache entry invalidation and data removal on delete and update
of the toasted values and am currently experimenting with large values.
On Tue, Mar 5, 2024 at 5:59 AM Andy Fan <zhihuifan1213@163.com> wrote:
>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB.
>>
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>>
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.
This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.
It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB.
> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
>>>> I assumed that releasing all of the memory at the end of executor once
>>>> is not an option since it may consumed too many memory. Then, when and
>>>> which entry to release becomes a trouble for me. For example:
>>>>
>>>> QUERY PLAN
>>>> ------------------------------
>>>> Nested Loop
>>>> Join Filter: (t1.a = t2.a)
>>>> -> Seq Scan on t1
>>>> -> Seq Scan on t2
>>>> (4 rows)
>>>>
>>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>>> in outer relation. Without the help from slot's life-cycle system, I
>>>> can't think out a answer for the above question.
>>>>
>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>>
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common.
>>
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?
I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not
true that once it is get disabled, it can't be enabled any more for the
given query.
> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.
> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.
Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.
> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.
A great lession learnt here, thanks for sharing this!
As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".
USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discribed at the beginning
of this writing.
>>> Also, this leads me to the need of having some sort of memory limit. If
>>> we may be keeping entries for extended periods of time, and we don't
>>> have any way to limit the amount of memory, that does not seem great.
>>>
>>> AFAIK if we detoast everything into tts_values[] there's no way to
>>> implement and enforce such limit. What happens if there's a row with
>>> multiple large-ish TOAST values? What happens if those rows are in
>>> different (and distant) parts of the plan?
>>
>> I think this can be done by tracking the memory usage on EState level
>> or global variable level and disable it if it exceeds the limits and
>> resume it when we free the detoast datum when we don't need it. I think
>> no other changes need to be done.
>>
>
> That seems like a fair amount of additional complexity. And what if the
> toasted values are accessed in context without EState (I haven't checked
> how common / important that is)?
>
> And I'm not sure just disabling the detoast as a way to enforce a memory
> limit, as explained earlier.
>
>>> It seems far easier to limit the memory with the toast cache.
>>
>> I think the memory limit and entry eviction is the key issue now. IMO,
>> there are still some difference when both methods can support memory
>> limit. The reason is my patch can grantee the cached memory will be
>> reused, so if we set the limit to 10MB, we know all the 10MB is
>> useful, but the TOAST cache method, probably can't grantee that, then
>> when we want to make it effecitvely, we have to set a higher limit for
>> this.
>>
>
> Can it actually guarantee that? It can guarantee the slot may be used,
> but I don't see how could it guarantee the detoasted value will be used.
> We may be keeping the slot for other reasons. And even if it could
> guarantee the detoasted value will be used, does that actually prove
> it's better to keep that value? What if it's only used once, but it's
> blocking detoasting of values that will be used 10x that?
>
> If we detoast a 10MB value in the outer side of the Nest Loop, what if
> the inner path has multiple accesses to another 10MB value that now
> can't be detoasted (as a shared value)?
Grarantee may be wrong word. The difference in my mind are:
1. plan shape have better potential to know the user case of datum,
since we know the plan tree and knows the rows pass to a given node.
2. Planning time effort is cheaper than run-time effort.
3. eviction in my method is not as important as it is in TOAST cache
method since it is reset per slot, so usually it doesn't hit limit in
fact. But as a cache, it does.
4. use up to memory limit we set in TOAST cache case.
>>> In any case, my concern is more about having to do this when creating
>>> the plan at all, the code complexity etc. Not just because it might have
>>> performance impact.
>>
>> I think the main trade-off is TOAST cache method is pretty non-invasive
>> but can't control the eviction well, the impacts includes:
>> 1. may evicting the datum we want and kept the datum we don't need.
>
> This applies to any eviction algorithm, not just LRU. Ultimately what
> matters is whether we have in the cache the most often used values, i.e.
> the hit ratio (perhaps in combination with how expensive detoasting that
> particular entry was).
Correct, just that I am doubtful about design a LOCAL CACHE for USER
DATA with the reason I described above.
At last, thanks for your attention, really appreciated about it!
--
Best Regards
Andy Fan
Hi,
In addition to the previous message - for the toast_fetch_datum_slice
the first (seems obvious) way is to detoast the whole value, save it
to cache and get slices from it on demand. I have another one on my
mind, but have to play with it first.
On 3/6/24 07:09, Nikita Malakhov wrote: > Hi, > > In addition to the previous message - for the toast_fetch_datum_slice > the first (seems obvious) way is to detoast the whole value, save it > to cache and get slices from it on demand. I have another one on my > mind, but have to play with it first. > What if you only need the first slice? In that case decoding everything will be a clear regression. IMHO this needs to work like this: 1) check if the cache already has sufficiently large slice detoasted (and use the cached value if yes) 2) if there's no slice detoasted, or if it's too small, detoast a sufficiently large slice and store it in the cache (possibly evicting the already detoasted slice) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/5/24 10:03, Nikita Malakhov wrote: > Hi, > > Tomas, sorry for confusion, in my previous message I meant exactly > the same approach you've posted above, and came up with almost > the same implementation. > > Thank you very much for your attention to this thread! > > I've asked Andy about this approach because of the same reasons you > mentioned - it keeps cache code small, localized and easy to maintain. > > The question that worries me is the memory limit, and I think that cache > lookup and entry invalidation should be done in toast_tuple_externalize > code, for the case the value has been detoasted previously is updated > in the same query, like > I haven't thought very much about when we need to invalidate entries and where exactly should that happen. My patch is a very rough PoC that I wrote in ~1h, and it's quite possible it will have to move or should be refactored in some way. But I don't quite understand the problem with this query: > UPDATE test SET value = value || '...'; Surely the update creates a new TOAST value, with a completely new TOAST pointer, new rows in the TOAST table etc. It's not updated in-place. So it'd end up with two independent entries in the TOAST cache. Or are you interested just in evicting the old value simply to free the memory, because we're not going to need that (now invisible) value? That seems interesting, but if it's too invasive or complex I'd just leave it up to the LRU eviction (at least for v1). I'm not sure why should anything happen in toast_tuple_externalize, that seems like a very low-level piece of code. But I realized there's also toast_delete_external, and maybe that's a good place to invalidate the deleted TOAST values (which would also cover the UPDATE). > I've added cache entry invalidation and data removal on delete and update > of the toasted values and am currently experimenting with large values. > I haven't done anything about large values in the PoC patch, but I think it might be a good idea to either ignore values that are too large (so that it does not push everything else from the cache) or store them in compressed form (assuming it's small enough, to at least save the I/O). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi!
Tomas, I thought about this issue -
>What if you only need the first slice? In that case decoding everything
>will be a clear regression.And completely agree with you, I'm currently working on it and will post
a patch a bit later. Another issue we have here - if we have multiple
slices requested - then we have to update cached slice from both sides,
the beginning and the end.
On update, yep, you're right
>Surely the update creates a new TOAST value, with a completely new TOAST
>pointer, new rows in the TOAST table etc. It's not updated in-place. So
>it'd end up with two independent entries in the TOAST cache.
>Or are you interested just in evicting the old value simply to free the
>memory, because we're not going to need that (now invisible) value? That
>seems interesting, but if it's too invasive or complex I'd just leave it
>up to the LRU eviction (at least for v1).
>pointer, new rows in the TOAST table etc. It's not updated in-place. So
>it'd end up with two independent entries in the TOAST cache.
>Or are you interested just in evicting the old value simply to free the
>memory, because we're not going to need that (now invisible) value? That
>seems interesting, but if it's too invasive or complex I'd just leave it
>up to the LRU eviction (at least for v1).
Again, yes, we do not need the old value after it was updated and
it is better to take care of it explicitly. It's a simple and not invasive
addition to your patch.
Could not agree with you about large values - it makes sense
to cache large values because the larger it is the more benefit
we will have from not detoasting it again and again. One way
is to keep them compressed, but what if we have data with very
low compression rate?
I also changed HASHACTION field value from HASH_ENTER to
HASH_ENTER_NULL to softly deal with case when we do not
have enough memory and added checks for if the value was stored
or not.
On 3/7/24 08:33, Nikita Malakhov wrote: > Hi! > > Tomas, I thought about this issue - >> What if you only need the first slice? In that case decoding everything >> will be a clear regression. > And completely agree with you, I'm currently working on it and will post > a patch a bit later. Cool. I don't plan to work on improving my patch - it was merely a PoC, so if you're interested in working on that, that's good. > Another issue we have here - if we have multiple slices requested - > then we have to update cached slice from both sides, the beginning > and the end. > No opinion. I haven't thought about how to handle slices too much. > On update, yep, you're right >> Surely the update creates a new TOAST value, with a completely new TOAST >> pointer, new rows in the TOAST table etc. It's not updated in-place. So >> it'd end up with two independent entries in the TOAST cache. > >> Or are you interested just in evicting the old value simply to free the >> memory, because we're not going to need that (now invisible) value? That >> seems interesting, but if it's too invasive or complex I'd just leave it >> up to the LRU eviction (at least for v1). > Again, yes, we do not need the old value after it was updated and > it is better to take care of it explicitly. It's a simple and not invasive > addition to your patch. > OK > Could not agree with you about large values - it makes sense > to cache large values because the larger it is the more benefit > we will have from not detoasting it again and again. One way > is to keep them compressed, but what if we have data with very > low compression rate? > I'm not against caching large values, but I also think it's not difficult to construct cases where it's a losing strategy. > I also changed HASHACTION field value from HASH_ENTER to > HASH_ENTER_NULL to softly deal with case when we do not > have enough memory and added checks for if the value was stored > or not. > I'm not sure about this. HASH_ENTER_NULL is only being used in three very specific places (two are lock management, one is shmeminitstruct). This hash table is not unique in any way, so why not to use HASH_ENTER like 99% other hash tables? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi!
Here's a slightly improved version of patch Tomas provided above (v2),
with cache invalidations and slices caching added, still as PoC.
The main issue I've encountered during tests is that when the same query
retrieves both slices and full value - slices, like substring(t,...) the order of
the values is not predictable, with text fields substrings are retrieved
before the full value and we cannot benefit from cache full value first
and use slices from cached value.
to detoasting.
Tomas, about HASH_ENTER - according to comments it could throw
an OOM error, so I've changed it to HASH_ENTER_NULL to avoid
new errors. In this case we would just have the value not cached
without an error.
Attachment
Hi, Andy Fan asked me off-list for some feedback about this proposal. I have hesitated to comment on it for lack of having studied the matter in any detail, but since I've been asked for my input, here goes: I agree that there's a problem here, but I suspect this is not the right way to solve it. Let's compare this to something like the syscache. Specifically, let's think about the syscache on pg_class.relname. In the case of the syscache on pg_class.relname, there are two reasons why we might repeatedly look up the same values in the cache. One is that the code might do multiple name lookups when it really ought to do only one. Doing multiple lookups is bad for security and bad for performance, but we have code like that in some places and some of it is hard to fix. The other reason is that it is likely that the user will mention the same table names repeatedly; each time they do, they will trigger a lookup on the same name. By caching the result of the lookup, we can make it much faster. An important point to note here is that the queries that the user will issue in the future are unknown to us. In a certain sense, we cannot even know whether the same table name will be mentioned more than once in the same query: when we reach the first instance of the table name and look it up, we do not have any good way of knowing whether it will be mentioned again later, say due to a self-join. Hence, the pattern of cache lookups is fundamentally unknowable. But that's not the case in the motivating example that started this thread. In that case, the target list includes three separate references to the same column. We can therefore predict that if the column value is toasted, we're going to de-TOAST it exactly three times. If, on the other hand, the column value were mentioned only once, it would be detoasted just once. In that latter case, which is probably quite common, this whole cache idea seems like it ought to just waste cycles, which makes me suspect that no matter what is done to this patch, there will always be cases where it causes significant regressions. In the former case, where the column reference is repeated, it will win, but it will also hold onto the detoasted value after the third instance of detoasting, even though there will never be any more cache hits. Here, unlike the syscache case, the cache is thrown away at the end of the query, so future queries can't benefit. Even if we could find a way to preserve the cache in some cases, it's not very likely to pay off. People are much more likely to hit the same table more than once than to hit the same values in the same table more than once in the same session. Suppose we had a design where, when a value got detoasted, the detoasted value went into the place where the toasted value had come from. Then this problem would be handled pretty much perfectly: the detoasted value would last until the scan advanced to the next row, and then it would be thrown out. So there would be no query-lifespan memory leak. Now, I don't really know how this would work, because the TOAST pointer is coming out of a slot and we can't necessarily write one attribute of any arbitrary slot type without a bunch of extra overhead, but maybe there's some way. If it could be done cheaply enough, it would gain all the benefits of this proposal a lot more cheaply and with fewer downsides. Alternatively, maybe there's a way to notice the multiple references and introduce some kind of intermediate slot or other holding area where the detoasted value can be stored. In other words, instead of computing big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' Maybe we ought to be trying to compute this: big_jsonb_col=detoast(big_jsonb_col) big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' Or perhaps this: tmp=detoast(big_jsonb_col) tmp->'a', tmp->'b', tmp->'c' In still other words, a query-lifespan cache for this information feels like it's tackling the problem at the wrong level. I do suspect there may be query shapes where the same value gets detoasted multiple times in ways that the proposals I just made wouldn't necessarily catch, such as in the case of a self-join. But I think those cases are rare. In most cases, repeated detoasting is probably very localized, with the same exact TOAST pointer being de-TOASTed a couple of times in quick succession, so a query-lifespan cache seems like the wrong thing. Also, in most cases, repeated detoasting is probably quite predictable: we could infer it from static analysis of the query. So throwing ANY kind of cache at the problem seems like the wrong approach unless the cache is super-cheap, which doesn't seem to be the case with this proposal, because a cache caters to cases where we can't know whether we're going to recompute the same result, and here, that seems like something we could probably figure out. Because I don't see any way to avoid regressions in the common case where detoasting isn't repeated, I think this patch is likely never going to be committed no matter how much time anyone spends on it. But, to repeat the disclaimer from the top, all of the above is just a relatively uneducated opinion without any detailed study of the matter. ...Robert
Hi Robert, > Andy Fan asked me off-list for some feedback about this proposal. I > have hesitated to comment on it for lack of having studied the matter > in any detail, but since I've been asked for my input, here goes: Thanks for doing this! Since we have two totally different designs between Tomas and me and I think we both understand the two sides of each design, just that the following things are really puzzled to me. - which factors we should care more. - the possiblility to improve the design-A/B for its badness. But for the point 2, we probably needs some prefer design first. for now let's highlight that there are 2 totally different method to handle this problem. Let's call: design a: detoast the datum into slot->tts_values (and manage the memory with the lifespan of *slot*). From me. design b: detost the datum into a CACHE (and manage the memory with CACHE eviction and at released the end-of-query). From Tomas. Currently I pefer design a, I'm not sure if I want desgin a because a is really good or just because design a is my own design. So I will summarize the the current state for the easier discussion. I summarize them so that you don't need to go to the long discussion, and I summarize the each point with the link to the original post since I may misunderstand some points and the original post can be used as a double check. > I agree that there's a problem here, but I suspect this is not the > right way to solve it. Let's compare this to something like the > syscache. Specifically, let's think about the syscache on > pg_class.relname. OK. > In the case of the syscache on pg_class.relname, there are two reasons > why we might repeatedly look up the same values in the cache. One is > that the code might do multiple name lookups when it really ought to > do only one. Doing multiple lookups is bad for security and bad for > performance, but we have code like that in some places and some of it > is hard to fix. The other reason is that it is likely that the user > will mention the same table names repeatedly; each time they do, they > will trigger a lookup on the same name. By caching the result of the > lookup, we can make it much faster. An important point to note here is > that the queries that the user will issue in the future are unknown to > us. In a certain sense, we cannot even know whether the same table > name will be mentioned more than once in the same query: when we reach > the first instance of the table name and look it up, we do not have > any good way of knowing whether it will be mentioned again later, say > due to a self-join. Hence, the pattern of cache lookups is > fundamentally unknowable. True. > But that's not the case in the motivating example that started this > thread. In that case, the target list includes three separate > references to the same column. We can therefore predict that if the > column value is toasted, we're going to de-TOAST it exactly three > times. True. > If, on the other hand, the column value were mentioned only > once, it would be detoasted just once. In that latter case, which is > probably quite common, this whole cache idea seems like it ought to > just waste cycles, which makes me suspect that no matter what is done > to this patch, there will always be cases where it causes significant > regressions. I agree. > In the former case, where the column reference is > repeated, it will win, but it will also hold onto the detoasted value > after the third instance of detoasting, even though there will never > be any more cache hits. True. > Here, unlike the syscache case, the cache is > thrown away at the end of the query, so future queries can't benefit. > Even if we could find a way to preserve the cache in some cases, it's > not very likely to pay off. People are much more likely to hit the > same table more than once than to hit the same values in the same > table more than once in the same session. I agree. One more things I want to highlight it "syscache" is used for metadata and *detoast cache* is used for user data. user data is more likely bigger than metadata, so cache size control (hence eviction logic run more often) is more necessary in this case. The first graph in [1] (including the quota text) highlight this idea. > Suppose we had a design where, when a value got detoasted, the > detoasted value went into the place where the toasted value had come > from. Then this problem would be handled pretty much perfectly: the > detoasted value would last until the scan advanced to the next row, > and then it would be thrown out. So there would be no query-lifespan > memory leak. I think that is same idea as design a. > Now, I don't really know how this would work, because the > TOAST pointer is coming out of a slot and we can't necessarily write > one attribute of any arbitrary slot type without a bunch of extra > overhead, but maybe there's some way. I think the key points includes: 1. Where to put the *detoast value* so that ExprEval system can find out it cheaply. The soluation in A slot->tts_values[*]. 2. How to manage the memory for holding the detoast value. The soluation in A is: - using a lazy created MemoryContext in slot. - let the detoast system detoast the datum into the designed MemoryContext. 3. How to let Expr evalution run the extra cycles when needed. The soluation in A is: EEOP_XXX_TOAST step is designed to get riding of some *run-time* *if (condition)" . I found the existing ExprEval system have many specialized steps. I will post a detailed overall design later and explain the its known badness so far. > If it could be done cheaply enough, it would gain all the benefits of > this proposal a lot more cheaply and with fewer > downsides. I agree. > Alternatively, maybe there's a way > to notice the multiple references and introduce some kind of > intermediate slot or other holding area where the detoasted value can > be stored. > > In other words, instead of computing > > big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' > > Maybe we ought to be trying to compute this: > > big_jsonb_col=detoast(big_jsonb_col) > big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' > > Or perhaps this: > > tmp=detoast(big_jsonb_col) > tmp->'a', tmp->'b', tmp->'c' Current the design a doesn't use 'tmp' so that the existing ExprEval engine can find out "detoast datum" easily, so it is more like: > big_jsonb_col=detoast(big_jsonb_col) > big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' [1] https://www.postgresql.org/message-id/87ttllbd00.fsf%40163.com -- Best Regards Andy Fan
Andy Fan <zhihuifan1213@163.com> writes: > Hi Robert, > >> Andy Fan asked me off-list for some feedback about this proposal. I >> have hesitated to comment on it for lack of having studied the matter >> in any detail, but since I've been asked for my input, here goes: > > Thanks for doing this! Since we have two totally different designs > between Tomas and me and I think we both understand the two sides of > each design, just that the following things are really puzzled to me. my emacs was stuck and somehow this incompleted message was sent out, Please ignore this post for now. Sorry for this. -- Best Regards Andy Fan
Hi,
Robert, thank you very much for your response to this thread.
I agree with most of things you've mentioned, but this proposal
is a PoC, and anyway has a long way to go to be presented
(if it ever would) as something to be committed.
Andy, glad you've not lost interest in this work, I'm looking
forward to your improvements!
On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote: > One more things I want to highlight it "syscache" is used for metadata > and *detoast cache* is used for user data. user data is more > likely bigger than metadata, so cache size control (hence eviction logic > run more often) is more necessary in this case. The first graph in [1] > (including the quota text) highlight this idea. Yes. > I think that is same idea as design a. Sounds similar. > I think the key points includes: > > 1. Where to put the *detoast value* so that ExprEval system can find out > it cheaply. The soluation in A slot->tts_values[*]. > > 2. How to manage the memory for holding the detoast value. > > The soluation in A is: > > - using a lazy created MemoryContext in slot. > - let the detoast system detoast the datum into the designed > MemoryContext. > > 3. How to let Expr evalution run the extra cycles when needed. I don't understand (3) but I agree that (1) and (2) are key points. In many - nearly all - circumstances just overwriting slot->tts_values is unsafe and will cause problems. To make this work, we've got to find some way of doing this that is compatible with general design of TupleTableSlot, and I think in that API, just about the only time you can touch slot->tts_values is when you're trying to populate a virtual slot. But often we'll have some other kind of slot. And even if we do have a virtual slot, I think you're only allowed to manipulate slot->tts_values up until you call ExecStoreVirtualTuple. That's why I mentioned using something temporary. You could legally (a) materialize the existing slot, (b) create a new virtual slot, (c) copy over the tts_values and tts_isnull arrays, (c) detoast any datums you like from tts_values and store the new datums back into the array, (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place of the original one. That would avoid repeated detoasting, but it would also have a bunch of overhead. Having to fully materialize the existing slot is a cost that you wouldn't always need to pay if you didn't do this, but you have to do it to make this work. Creating the new virtual slot costs something. Copying the arrays costs something. Detoasting costs quite a lot potentially, if you don't end up using the values. If you end up needing a heap tuple representation of the slot, you're going to now have to make a new one rather than just using the one that came straight from the buffer. So I don't think we could do something like this unconditionally. We'd have to do it only when there was a reasonable expectation that it would work out, which means we'd have to be able to predict fairly accurately whether it was going to work out. But I do agree with you that *if* we could make it work, it would probably be better than a longer-lived cache. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote: >> One more things I want to highlight it "syscache" is used for metadata >> and *detoast cache* is used for user data. user data is more >> likely bigger than metadata, so cache size control (hence eviction logic >> run more often) is more necessary in this case. The first graph in [1] >> (including the quota text) highlight this idea. > > Yes. > >> I think that is same idea as design a. > > Sounds similar. > This is really great to know! >> I think the key points includes: >> >> 1. Where to put the *detoast value* so that ExprEval system can find out >> it cheaply. The soluation in A slot->tts_values[*]. >> >> 2. How to manage the memory for holding the detoast value. >> >> The soluation in A is: >> >> - using a lazy created MemoryContext in slot. >> - let the detoast system detoast the datum into the designed >> MemoryContext. >> >> 3. How to let Expr evalution run the extra cycles when needed. > > I don't understand (3) but I agree that (1) and (2) are key points. In > many - nearly all - circumstances just overwriting slot->tts_values is > unsafe and will cause problems. To make this work, we've got to find > some way of doing this that is compatible with general design of > TupleTableSlot, and I think in that API, just about the only time you > can touch slot->tts_values is when you're trying to populate a virtual > slot. But often we'll have some other kind of slot. And even if we do > have a virtual slot, I think you're only allowed to manipulate > slot->tts_values up until you call ExecStoreVirtualTuple. Please give me one more chance to explain this. What I mean is: Take SELECT f(a) FROM t1 join t2...; for example, When we read the Datum of a Var, we read it from tts->tts_values[*], no matter what kind of TupleTableSlot is. So if we can put the "detoast datum" back to the "original " tts_values, then nothing need to be changed. Aside from efficiency considerations, security and rationality are also important considerations. So what I think now is: - tts_values[*] means the Datum from the slot, and it has its operations like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot, ExecClearTuple and so on. All of the characters have nothing with what kind of slot it is. - Saving the "detoast datum" version into tts_values[*] doesn't break the above semantic and the ExprEval engine just get a detoast version so it doesn't need to detoast it again. - The keypoint is the memory management and effeiciency. for now I think all the places where "slot->tts_nvalid" is set to 0 means the tts_values[*] is no longer validate, then this is the place we should release the memory for the "detoast datum". All the other places like ExecMaterializeSlot or ExecCopySlot also need to think about the "detoast datum" as well. > That's why I mentioned using something temporary. You could legally > (a) materialize the existing slot, (b) create a new virtual slot, (c) > copy over the tts_values and tts_isnull arrays, (c) detoast any datums > you like from tts_values and store the new datums back into the array, > (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place > of the original one. That would avoid repeated detoasting, but it > would also have a bunch of overhead. Yes, I agree with the overhead, so I perfer to know if the my current strategy has principle issue first. > Having to fully materialize the > existing slot is a cost that you wouldn't always need to pay if you > didn't do this, but you have to do it to make this work. Creating the > new virtual slot costs something. Copying the arrays costs something. > Detoasting costs quite a lot potentially, if you don't end up using > the values. If you end up needing a heap tuple representation of the > slot, you're going to now have to make a new one rather than just > using the one that came straight from the buffer. > But I do agree with you > that *if* we could make it work, it would probably be better than a > longer-lived cache. I noticed the "if" and great to know that:), speically for the efficiecy & memory usage control purpose. Another big challenge of this is how to decide if a Var need this pre-detoasting strategy, we have to consider: - The in-place tts_values[*] update makes the slot bigger, which is harmful for CP_SMALL_TLIST (createplan.c) purpose. - ExprEval runtime checking if the Var is toastable is an overhead. It is must be decided ExecInitExpr time. The code complexity because of the above 2 factors makes people (include me) unhappy. === Yesterday I did some worst case testing for the current strategy, and the first case show me ~1% *unexpected* performance regression and I tried to find out it with perf record/report, and no lucky. that's the reason why I didn't send a complete post yesterday. As for now, since we are talking about the high level design, I think it is OK to post the improved design document and the incompleted worst case testing data and my planning. Current Design -------------- The high level design is let createplan.c and setref.c decide which Vars can use this feature, and then the executor save the detoast datum back slot->to tts_values[*] during the ExprEvalStep of EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes: - The existing expression engine read datum from tts_values[*], no any extra work need to be done. - Reuse the lifespan of TupleTableSlot system to manage memory. It is natural to think the detoast datum is a tts_value just that it is in a detoast format. Since we have a clear lifespan for TupleTableSlot already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse them for managing the datoast datum's memory. - The existing projection method can copy the datoasted datum (int64) automatically to the next node's slot, but keeping the ownership unchanged, so only the slot where the detoast really happen take the charge of it's lifespan. Assuming which Var should use this feature has been decided in createplan.c and setref.c already. The following strategy is used to reduce the run time overhead. 1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval system doesn't need any extra effort to find them. static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { if (unlikely(slot->tts_data_mctx == NULL)) slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt, "tts_value_ctx", ALLOCSET_START_SMALL_SIZES); slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum], /* save the detoast value to the given MemoryContext. */ slot->tts_data_mctx)); } } 2. Using a dedicated lazy created memory context under TupleTableSlot so that the memory can be released effectively. static inline void ExecFreePreDetoastDatum(TupleTableSlot *slot) { if (slot->tts_data_mctx) MemoryContextResetOnly(slot->tts_data_mctx); } 3. Control the memory usage by releasing them whenever the slot->tts_values[*] is deemed as invalidate, so the lifespan is likely short. 4. Reduce the run-time overhead for checking if a VAR should use pre-detoast feature, 3 new steps are introduced. EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non related. Now comes to the createplan.c/setref.c part, which decides which Vars should use the shared detoast feature. The guideline of this is: 1. It needs a detoast for a given expression in the previous logic. 2. It should not breaks the CP_SMALL_TLIST design. Since we saved the detoast datum back to tts_values[*], which make tuple bigger. if we do this blindly, it would be harmful to the ORDER / HASH style nodes. A high level data flow is: 1. at the createplan.c, we walk the plan tree go gather the CP_SMALL_TLIST because of SORT/HASH style nodes, information and save it to Plan.forbid_pre_detoast_vars via the function set_plan_forbid_pre_detoast_vars_recurse. 2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each expression, so it is a good time to track the attribute number and see if the Var is directly or indirectly accessed. Usually the indirectly access a Var means a detoast would happens, for example an expression like a > 3. However some known expressions is ignored. for example: NullTest, pg_column_compression which needs the raw datum, start_with/sub_string which needs a slice detoasting. Currently there is some hard code here, we may needs a pg_proc.detoasting_requirement flags to make this generic. The output is {Scan|Join}.xxx_reference_attrs; Note that here I used '_reference_' rather than '_detoast_' is because at this part, I still don't know if it is a toastable attrbiute, which is known at the MakeTupleTableSlot stage. 3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs in ScanState & JoinState, which will be passed into expression engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN} _VAR_TOAST steps are generated finally then everything is connected with ExecSlotDetoastDatum! Bad case testing of current design: ==================================== - The extra effort of createplan.c & setref.c & execExpr.c & InitPlan : CREATE TABLE t(a int, b numeric); q1: explain select * from t where a > 3; q2: set enable_hashjoin to off; set enable_mergejoin to off; set enable_nestloop to on; explain select * from t join t t2 using(a); q3: set enable_hashjoin to off; set enable_mergejoin to on; set enable_nestloop to off; explain select * from t join t t2 using(a); q4: set enable_hashjoin to on; set enable_mergejoin to off; set enable_nestloop to off; explain select * from t join t t2 using(a); pgbench -f x.sql -n postgres -M simple -T10 | query ID | patched (ms) | master (ms) | regression(%) | |----------+-------------------------------------+-------------------------------------+---------------| | 1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} | 1.14% | | 2 | not-done-yet | | | | 3 | not-done-yet | | | | 4 | not-done-yet | | | - The extra effort of ExecInterpExpr insert into t select i, i FROM generate_series(1, 100000) i; SELECT count(b) from t WHERE b > 0; In this query, the extra execution code is run but it doesn't buy anything. | master | patched | regression | |--------+---------+------------| | | | | What I am planning now is: - Gather more feedback on the overall strategy. - figure out the 1% unexpected regression. I don't have a clear direction for now, my current expression is analyzing the perf report, and I can't find out the root cause. and reading the code can't find out the root cause as well. - Simplifying the "planner part" for deciding which Var should use this strategy. I don't have clear direction as well. - testing and improving more worst case. Attached is a appliable and testingable version against current master (e87e73245). -- Best Regards Andy Fan
Attachment
Nikita Malakhov <hukutoc@gmail.com> writes: > Hi, > Andy, glad you've not lost interest in this work, I'm looking > forward to your improvements! Thanks for your words, I've adjusted to the rhythm of the community and welcome more feedback:) -- Best Regards Andy Fan
On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote: > Please give me one more chance to explain this. What I mean is: > > Take SELECT f(a) FROM t1 join t2...; for example, > > When we read the Datum of a Var, we read it from tts->tts_values[*], no > matter what kind of TupleTableSlot is. So if we can put the "detoast > datum" back to the "original " tts_values, then nothing need to be > changed. Yeah, I don't think you can do that. > - Saving the "detoast datum" version into tts_values[*] doesn't break > the above semantic and the ExprEval engine just get a detoast version > so it doesn't need to detoast it again. I don't think this is true. If it is true, it needs to be extremely well-justified. Even if this seems to work in simple cases, I suspect there will be cases where it breaks badly, resulting in memory leaks or server crashes or some other kind of horrible problem that I can't quite imagine. Unfortunately, my knowledge of this code isn't fantastic, so I can't say exactly what bad thing will happen, and I can't even say with 100% certainty that anything bad will happen, but I bet it will. It seems like it goes against everything I understand about how TupleTableSlots are supposed to be used. (Andres would be the best person to give a definitive answer here.) > - The keypoint is the memory management and effeiciency. for now I think > all the places where "slot->tts_nvalid" is set to 0 means the > tts_values[*] is no longer validate, then this is the place we should > release the memory for the "detoast datum". All the other places like > ExecMaterializeSlot or ExecCopySlot also need to think about the > "detoast datum" as well. This might be a way of addressing some of the issues with this idea, but I doubt it will be acceptable. I don't think we want to complicate the slot API for this feature. There's too much downside to doing that, in terms of performance and understandability. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote: >> Please give me one more chance to explain this. What I mean is: >> >> Take SELECT f(a) FROM t1 join t2...; for example, >> >> When we read the Datum of a Var, we read it from tts->tts_values[*], no >> matter what kind of TupleTableSlot is. So if we can put the "detoast >> datum" back to the "original " tts_values, then nothing need to be >> changed. > > Yeah, I don't think you can do that. > >> - Saving the "detoast datum" version into tts_values[*] doesn't break >> the above semantic and the ExprEval engine just get a detoast version >> so it doesn't need to detoast it again. > > I don't think this is true. If it is true, it needs to be extremely > well-justified. Even if this seems to work in simple cases, I suspect > there will be cases where it breaks badly, resulting in memory leaks > or server crashes or some other kind of horrible problem that I can't > quite imagine. Unfortunately, my knowledge of this code isn't > fantastic, so I can't say exactly what bad thing will happen, and I > can't even say with 100% certainty that anything bad will happen, but > I bet it will. It seems like it goes against everything I understand > about how TupleTableSlots are supposed to be used. (Andres would be > the best person to give a definitive answer here.) > >> - The keypoint is the memory management and effeiciency. for now I think >> all the places where "slot->tts_nvalid" is set to 0 means the >> tts_values[*] is no longer validate, then this is the place we should >> release the memory for the "detoast datum". All the other places like >> ExecMaterializeSlot or ExecCopySlot also need to think about the >> "detoast datum" as well. > > This might be a way of addressing some of the issues with this idea, > but I doubt it will be acceptable. I don't think we want to complicate > the slot API for this feature. There's too much downside to doing > that, in terms of performance and understandability. Thanks for the doubt! Let's see if Andres has time to look at this. I feel great about the current state. -- Best Regards Andy Fan
Hi, Let's me clarify the current state of this patch and what kind of help is needed. You can check [1] for what kind of problem it try to solve and what is the current approach. The current blocker is if the approach is safe (potential to memory leak crash etc.). And I want to have some agreement on this step at least. """ When we access [some desired] toast column in EEOP_{SCAN|INNER_OUTER}_VAR steps, we can detoast it immediately and save it back to slot->tts_values. so the later ExprState can use the same detoast version all the time. this should be more the most effective and simplest way. """ IMO, it doesn't break any design of TupleTableSlot and should be safe. Here is my understanding of TupleTableSlot, please correct me if anything I missed. (a). TupleTableSlot define tts_values[*] / tts_isnulls[*] at the high level. When the upper layer want a attnum, it just access tts_values/nulls[n]. (b). TupleTableSlot has many different variants, depends on how to get tts_values[*] from them, like VirtualTupleTableSlot, HeapTupleTableSlot, how to manage its resource (mainly memory) when the life cycle is done, for example BufferHeapTupleTableSlot. (c). TupleTableSlot defines a lot operation against on that, like getsomeattrs, copy, clear and so on, for the different variants. the impacts of my proposal on the above factor: (a). it doesn't break it clearly, (b). it adds a 'detoast' step for step (b) when fill the tts_values[*], (this is done conditionally, see [1] for the overall design, we currently focus the safety). What's more, I only added the step in EEOP_{SCAN|INNER_OUTER}_VAR_TOAST step, it reduce the impact in another step. (c). a special MemoryContext in TupleTableSlot is introduced for the detoast datum, it is reset whenever the TupleTableSlot.tts_nvalid = 0. and I also double check every operation defined in TupleTableSlotOps, looks nothing should be specially handled. Robert has expressed his concern as below and wanted Andres to have a look at, but unluckly Andres doesn't have such time so far:( > Robert Haas <robertmhaas@gmail.com> writes: > >> On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote: >>> Please give me one more chance to explain this. What I mean is: >>> >>> Take SELECT f(a) FROM t1 join t2...; for example, >>> >>> When we read the Datum of a Var, we read it from tts->tts_values[*], no >>> matter what kind of TupleTableSlot is. So if we can put the "detoast >>> datum" back to the "original " tts_values, then nothing need to be >>> changed. >> >> Yeah, I don't think you can do that. >> >>> - Saving the "detoast datum" version into tts_values[*] doesn't break >>> the above semantic and the ExprEval engine just get a detoast version >>> so it doesn't need to detoast it again. >> >> I don't think this is true. If it is true, it needs to be extremely >> well-justified. Even if this seems to work in simple cases, I suspect >> there will be cases where it breaks badly, resulting in memory leaks >> or server crashes or some other kind of horrible problem that I can't >> quite imagine. Unfortunately, my knowledge of this code isn't >> fantastic, so I can't say exactly what bad thing will happen, and I >> can't even say with 100% certainty that anything bad will happen, but >> I bet it will. It seems like it goes against everything I understand >> about how TupleTableSlots are supposed to be used. (Andres would be >> the best person to give a definitive answer here.) I think having a discussion of the design of TupleTableSlots would be helpful for the progress since in my knowldege it doesn't against anything of the TupleTaleSlots :(, as I stated above. I'm sure I'm pretty open on this. >> >>> - The keypoint is the memory management and effeiciency. for now I think >>> all the places where "slot->tts_nvalid" is set to 0 means the >>> tts_values[*] is no longer validate, then this is the place we should >>> release the memory for the "detoast datum". All the other places like >>> ExecMaterializeSlot or ExecCopySlot also need to think about the >>> "detoast datum" as well. >> >> This might be a way of addressing some of the issues with this idea, >> but I doubt it will be acceptable. I don't think we want to complicate >> the slot API for this feature. There's too much downside to doing >> that, in terms of performance and understandability. Complexity: I double check the code, it has very limitted changes on the TupleTupleSlot APIs (less than 10 rows). see tuptable.h. Performance: by design, it just move the chance of detoast action *conditionly* and put it back to tts_values[*], there is no extra overhead to find out the detoast version for sharing. Understandability: based on how do we understand TupleTableSlot:) [1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com -- Best Regards Andy Fan