Thread: Shared detoast Datum proposal

Shared detoast Datum proposal

From

Andy Fan

Date:

27 December 2023, 10:45:25

Problem:
--------
Toast works well for its claimed purposes. However the current detoast
infrastructure doesn't shared the detoast datum in the query's
lifespan, which will cause some inefficiency like this:

SELECT big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'
FROM t;

In the above case, the big_jsonb_col will be detoast 3 times for every
record.

a more common case maybe:

CREATE TABLE t(a numeric);
insert into t select i FROM generate_series(1, 10)i;
SELECT * FROM t WHERE a > 3;

Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to
be detoasted twice, one is in a > 3,  the other one is in the targetlist,
where we need to call numeric_out. 

Proposal
--------

When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values.  With this way, when any other expressions seeks for
the Datum, it get the detoast version. Planner decides which Vars
should use this feature, executor manages it detoast action and memory
management. 

Planner design
---------------

1. This feature only happen at the Plan Node where the detoast would
   happen in the previous topology, for example:

   for example:
   
   SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col);

   toast_col just happens at the Join node's slot even if we have a
   projection on t1 or t2 at the scan node (except the Parameterized path).

   However if

   SELECT t1.toast_col, t2.toast_col
   FROM t1 join t2
   USING (toast_col)
   WHERE t1.toast_col > 'a';

   the detoast *may* happen at the scan of level t1 since "t1.toast_col >
   'a'" accesses the Var within a FuncCall ('>' operator), which will
   cause a detoast. (However it should not happen if it is under a Sort
   node, for details see Planner Design section 2).

   At the implementation side, I added "Bitmapset  *reference_attrs;" to
   Scan node which show if the Var should be accessed with the
   pre-detoast way in expression execution engine. the value is
   maintained at the create_plan/set_plan_refs stage. 

   Two similar fields are added in Join node.

       Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;

   In the both case, I tracked the level of walker/mutator, if the level
   greater than 1 when we access a Var, the 'var->varattno - 1' is added
   to the bitmapset. Some special node should be ignored, see
   increase_level_for_pre_detoast for details.

2. We also need to make sure the detoast datum will not increase the
   work_mem usage for the nodes like Sort/Hash etc, all of such nodes
   can be found with search 'CP_SMALL_TLIST' flags. 

   If the a node under a Sort-Hash-like nodes, we have some extra
   checking to see if a Var is a *directly input* of such nodes. If yes,
   we can't detoast it in advance, or else, we know the Var has been
   discarded before goes to these nodes, we still can use the shared
   detoast feature.

   The simplest cases to show this is:

   For example:

   2.1
   Query-1
   explain (verbose) select * from t1 where b > 'a';
   -- b can be detoast in advance.

   Query-2
   explain (verbose) select * from t1 where b > 'a' order by c;
   -- b can't be detoast since it will makes the Sort use more work_mem.

   Query-3
   explain (verbose) select a, c from t1 where b > 'a' order by c;
   -- b can be pre-detoasted, since it is discarded before goes to Sort
   node. In this case it doesn't do anything good, but for some complex
   case like Query-4, it does. 

Query-4   
explain (costs off, verbose)
select t3.*
from t1, t2, t3
where t2.c > '999999999999999'
and t2.c = t1.c
and t3.b = t1.b;

                           QUERY PLAN                          
--------------------------------------------------------------
 Hash Join
   Output: t3.a, t3.b, t3.c
   Hash Cond: (t3.b = t1.b)
   ->  Seq Scan on public.t3
         Output: t3.a, t3.b, t3.c
   ->  Hash
         Output: t1.b
         ->  Nested Loop
               Output: t1.b  <-- Note here 3
               ->  Seq Scan on public.t2
                     Output: t2.a, t2.b, t2.c
                     Filter: (t2.c > '9999...'::text) <--- Note Here 1
               ->  Index Scan using t1_c_idx on public.t1
                     Output: t1.a, t1.b, t1.c
                     Index Cond: (t1.c = t2.c)  <--- Note Here 2 
(15 rows) 
 
In this case, detoast datum for t2.c can be shared and it benefits for
t2.c = t1.c and no harm for the Hash node.


Execution side
--------------

Once we decide a Var should be pre-detoasted for a given plan node, a
special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created
during ExecInitExpr stage. The special steps are introduced to avoid its
impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path.

slot->tts_values is used to store the detoast datum so that any other
expressions can access it pretty easily.

Because of the above design, the detoast datum should have a same
lifespan as any other slot->tts_values[*], so the default
ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt
to hold the memory, and maintaining these lifecycles in execTuples.c. To
know which datum in slot->tts_values is pre-detoasted, Bitmapset *
slot->detoast_attrs is introduced.

During the execution of these steps, below code like this is used:

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        Datum        oldDatum;
        MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

        oldDatum = slot->tts_values[attnum];
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
                                                                (struct varlena *) oldDatum));
        Assert(slot->tts_nvalid > attnum);
        if (oldDatum != slot->tts_values[attnum])
            slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
        MemoryContextSwitchTo(old);
    }
}


Testing
-------
- shared_detoast_slow.sql is used to test the planner related codes changes.

  'set jit to off' will enable more INFO logs about which Var is
  pre-detoast in which node level.

- the cases the patch doesn't help much.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;

Q3:
select a from w where a > 0;

Q4:
select b from w where b > 0;

pgbench -n -f 1.sql postgres -T 10 -M prepared

run 5 times and calculated the average value.

| Qry No |  Master | patched |  perf | comment                                 |
|--------+---------+---------+-------+-----------------------------------------|
|      3 | 309.438 | 308.411 |     0 | nearly zero impact on them              |
|      4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out |

- good case

  setup:

create table b(big jsonb);
insert into b select
jsonb_object_agg( x::text,
random()::text || random()::text || random()::text )
from generate_series(1,600) f(x);
insert into b select (select big from b) from generate_series(1, 1000)i;


  workload:
  Q1:
  select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

  Q2:
  select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2;


| No | Master | patched |
|----+--------+---------|
|  1 |   1677 |     357 |
|  2 |    638 |     332 |


Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.
3. This feature builds a strong relationship between slot->tts_values
   and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a
   detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset,
   however if someone changes the value directly, like
   "slot->tts_values[1] = 2;" but without touching the slot's
   pre_detoast_attrs then troubles comes. The good thing is the detoast
   can only happen on Scan/Join nodes. so any other slot is impossible
   to have such issue. I run the following command to find out such
   cases, looks none of them is a slot form Scan/Join node.

   egrep -nri 'slot->tts_values\[[^\]]*\] = *

Any thought?

-- 
Best Regards
Andy Fan

Attachment

v1-0001-shared-detoast-feature.patch

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

01 January 2024, 13:55:02

Andy Fan <zhihuifan1213@163.com> writes:

>
> Some Known issues:
> ------------------
>
> 1. Currently only Scan & Join nodes are considered for this feature.
> 2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

-- 
Best Regards
Andy Fan

Attachment

v2-0001-shared-detoast-feature.patch

Re: Shared detoast Datum proposal

From

vignesh C

Date:

06 January 2024, 15:12:42

On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:
>
>
> Andy Fan <zhihuifan1213@163.com> writes:
>
> >
> > Some Known issues:
> > ------------------
> >
> > 1. Currently only Scan & Join nodes are considered for this feature.
> > 2. JIT is not adapted for this purpose yet.
>
> JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]
[09:47:01.748] [Thread debugging using libthread_db enabled]
[09:47:01.748] Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
[09:47:09.392] Core was generated by `postgres: postgres regression
[local] SELECT                                  '.
[09:47:09.392] Program terminated with signal SIGSEGV, Segmentation fault.
[09:47:09.392] #0  0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123]
[09:47:11.123] Thread 1 (Thread 0x7fa4f8050a40 (LWP 28182)):
[09:47:11.123] #0  0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123] No symbol table info available.
...
...
[09:47:11.123] #4  0x00007fa4ebc7a186 in LLVMOrcGetSymbolAddress () at
/build/llvm-toolchain-11-HMpQvg/llvm-toolchain-11-11.0.1/llvm/lib/ExecutionEngine/Orc/OrcCBindings.cpp:124
[09:47:11.123] No locals.
[09:47:11.123] #5  0x00007fa4eed6fc7a in llvm_get_function
(context=0x564b1813a8a0, funcname=0x7fa4eed4a570 "AWAVATSH\201",
<incomplete sequence \354\210>) at
../src/backend/jit/llvm/llvmjit.c:460
[09:47:11.123]         addr = 94880527996960
[09:47:11.123]         __func__ = "llvm_get_function"
[09:47:11.123] #6  0x00007fa4eed902e1 in ExecRunCompiledExpr
(state=0x0, econtext=0x564b18269d20, isNull=0x7ffc11054d5f) at
../src/backend/jit/llvm/llvmjit_expr.c:2577
[09:47:11.123]         cstate = <optimized out>
[09:47:11.123]         func = 0x564b18269d20
[09:47:11.123] #7  0x0000564b1698e614 in ExecEvalExprSwitchContext
(isNull=0x7ffc11054d5f, econtext=0x564b18269d20, state=0x564b182ad820)
at ../src/include/executor/executor.h:355
[09:47:11.123]         retDatum = <optimized out>
[09:47:11.123]         oldContext = 0x564b182680d0
[09:47:11.123]         retDatum = <optimized out>
[09:47:11.123]         oldContext = <optimized out>
[09:47:11.123] #8  ExecProject (projInfo=0x564b182ad818) at
../src/include/executor/executor.h:389
[09:47:11.123]         econtext = 0x564b18269d20
[09:47:11.123]         state = 0x564b182ad820
[09:47:11.123]         slot = 0x564b182ad788
[09:47:11.123]         isnull = false
[09:47:11.123]         econtext = <optimized out>
[09:47:11.123]         state = <optimized out>
[09:47:11.123]         slot = <optimized out>
[09:47:11.123]         isnull = <optimized out>
[09:47:11.123] #9  ExecMergeJoin (pstate=<optimized out>) at
../src/backend/executor/nodeMergejoin.c:836
[09:47:11.123]         node = <optimized out>
[09:47:11.123]         joinqual = 0x0
[09:47:11.123]         otherqual = 0x0
[09:47:11.123]         qualResult = <optimized out>
[09:47:11.123]         compareResult = <optimized out>
[09:47:11.123]         innerPlan = <optimized out>
[09:47:11.123]         innerTupleSlot = <optimized out>
[09:47:11.123]         outerPlan = <optimized out>
[09:47:11.123]         outerTupleSlot = <optimized out>
[09:47:11.123]         econtext = 0x564b18269d20
[09:47:11.123]         doFillOuter = false
[09:47:11.123]         doFillInner = false
[09:47:11.123]         __func__ = "ExecMergeJoin"
[09:47:11.123] #10 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b18269db0) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #11 0x0000564b169a2675 in ExecProcNode
(node=0x564b18269db0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #12 ExecRecursiveUnion (pstate=0x564b182684a0) at
../src/backend/executor/nodeRecursiveunion.c:115
[09:47:11.123]         node = 0x564b182684a0
[09:47:11.123]         outerPlan = 0x564b18268d00
[09:47:11.123]         innerPlan = 0x564b18269db0
[09:47:11.123]         plan = 0x564b182ddc78
[09:47:11.123]         slot = <optimized out>
[09:47:11.123]         isnew = false
[09:47:11.123] #13 0x0000564b1695a421 in ExecProcNode
(node=0x564b182684a0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #14 CteScanNext (node=0x564b183a6830) at
../src/backend/executor/nodeCtescan.c:103
[09:47:11.123]         cteslot = <optimized out>
[09:47:11.123]         estate = <optimized out>
[09:47:11.123]         dir = ForwardScanDirection
[09:47:11.123]         forward = true
[09:47:11.123]         tuplestorestate = 0x564b183a5cd0
[09:47:11.123]         eof_tuplestore = <optimized out>
[09:47:11.123]         slot = 0x564b183a6be0
[09:47:11.123] #15 0x0000564b1692e22b in ExecScanFetch
(node=node@entry=0x564b183a6830,
accessMtd=accessMtd@entry=0x564b1695a183 <CteScanNext>,
recheckMtd=recheckMtd@entry=0x564b16959db3 <CteScanRecheck>) at
../src/backend/executor/execScan.c:132
[09:47:11.123]         estate = <optimized out>
[09:47:11.123] #16 0x0000564b1692e332 in ExecScan
(node=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183
<CteScanNext>, recheckMtd=recheckMtd@entry=0x564b16959db3
<CteScanRecheck>) at ../src/backend/executor/execScan.c:181
[09:47:11.123]         econtext = 0x564b183a6b50
[09:47:11.123]         qual = 0x0
[09:47:11.123]         projInfo = 0x0
[09:47:11.123] #17 0x0000564b1695a5ea in ExecCteScan
(pstate=<optimized out>) at ../src/backend/executor/nodeCtescan.c:164
[09:47:11.123]         node = <optimized out>
[09:47:11.123] #18 0x0000564b169a81da in ExecProcNode
(node=0x564b183a6830) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #19 ExecSort (pstate=0x564b183a6620) at
../src/backend/executor/nodeSort.c:149
[09:47:11.123]         plannode = 0x564b182dfb78
[09:47:11.123]         outerNode = 0x564b183a6830
[09:47:11.123]         tupDesc = <optimized out>
[09:47:11.123]         tuplesortopts = <optimized out>
[09:47:11.123]         node = 0x564b183a6620
[09:47:11.123]         estate = 0x564b182681d0
[09:47:11.123]         dir = ForwardScanDirection
[09:47:11.123]         tuplesortstate = 0x564b18207ff0
[09:47:11.123]         slot = <optimized out>
[09:47:11.123] #20 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b183a6620) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #21 0x0000564b16913d01 in ExecProcNode
(node=0x564b183a6620) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #22 ExecutePlan (estate=estate@entry=0x564b182681d0,
planstate=0x564b183a6620, use_parallel_mode=<optimized out>,
operation=operation@entry=CMD_SELECT,
sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
direction=ForwardScanDirection, dest=0x564b182e2728,
execute_once=true) at ../src/backend/executor/execMain.c:1670
[09:47:11.123]         slot = <optimized out>
[09:47:11.123]         current_tuple_count = 0
[09:47:11.123] #23 0x0000564b16914024 in standard_ExecutorRun
(queryDesc=0x564b181ba200, direction=ForwardScanDirection, count=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:365
[09:47:11.123]         estate = 0x564b182681d0
[09:47:11.123]         operation = CMD_SELECT
[09:47:11.123]         dest = 0x564b182e2728
[09:47:11.123]         sendTuples = true
[09:47:11.123]         oldcontext = 0x564b181ba100
[09:47:11.123]         __func__ = "standard_ExecutorRun"
[09:47:11.123] #24 0x0000564b1691418f in ExecutorRun
(queryDesc=queryDesc@entry=0x564b181ba200,
direction=direction@entry=ForwardScanDirection, count=count@entry=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:309
[09:47:11.123] No locals.
[09:47:11.123] #25 0x0000564b16d208af in PortalRunSelect
(portal=portal@entry=0x564b1817ae10, forward=forward@entry=true,
count=0, count@entry=9223372036854775807,
dest=dest@entry=0x564b182e2728) at ../src/backend/tcop/pquery.c:924
[09:47:11.123]         queryDesc = 0x564b181ba200
[09:47:11.123]         direction = <optimized out>
[09:47:11.123]         nprocessed = <optimized out>
[09:47:11.123]         __func__ = "PortalRunSelect"
[09:47:11.123] #26 0x0000564b16d2405b in PortalRun
(portal=portal@entry=0x564b1817ae10,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x564b182e2728, altdest=altdest@entry=0x564b182e2728,
qc=0x7ffc110551e0) at ../src/backend/tcop/pquery.c:768
[09:47:11.123]         _save_exception_stack = 0x7ffc11055290
[09:47:11.123]         _save_context_stack = 0x0
[09:47:11.123]         _local_sigjmp_buf = {{__jmpbuf = {1,
3431825231787999889, 94880528213800, 94880526221072, 94880526741008,
94880526221000, -3433879442176195951, -8991885832768699759},
__mask_was_saved = 0, __saved_mask = {__val = {140720594047343, 688,
94880526749216, 94880511205302, 1, 140720594047343, 94880526999808, 8,
94880526213088, 112, 179, 94880526221072, 94880526221048,
94880526221000, 94880508502828, 2}}}}
[09:47:11.123]         _do_rethrow = <optimized out>
[09:47:11.123]         result = <optimized out>
[09:47:11.123]         nprocessed = <optimized out>
[09:47:11.123]         saveTopTransactionResourceOwner = 0x564b18139ad8
[09:47:11.123]         saveTopTransactionContext = 0x564b181229f0
[09:47:11.123]         saveActivePortal = 0x0
[09:47:11.123]         saveResourceOwner = 0x564b18139ad8
[09:47:11.123]         savePortalContext = 0x0
[09:47:11.123]         saveMemoryContext = 0x564b181229f0
[09:47:11.123]         __func__ = "PortalRun"
[09:47:11.123] #27 0x0000564b16d1d098 in exec_simple_query
(query_string=query_string@entry=0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...)
at ../src/backend/tcop/postgres.c:1273
[09:47:11.123]         cmdtaglen = 6
[09:47:11.123]         snapshot_set = <optimized out>
[09:47:11.123]         per_parsetree_context = 0x0
[09:47:11.123]         plantree_list = 0x564b182e26d8
[09:47:11.123]         parsetree = 0x564b180fbec8
[09:47:11.123]         commandTag = <optimized out>
[09:47:11.123]         qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
[09:47:11.123]         querytree_list = <optimized out>
[09:47:11.123]         portal = 0x564b1817ae10
[09:47:11.123]         receiver = 0x564b182e2728
[09:47:11.123]         format = 0
[09:47:11.123]         cmdtagname = <optimized out>
[09:47:11.123]         parsetree_item__state = {l = <optimized out>, i
= <optimized out>}
[09:47:11.123]         dest = DestRemote
[09:47:11.123]         oldcontext = 0x564b181229f0
[09:47:11.123]         parsetree_list = 0x564b180fbef8
[09:47:11.123]         parsetree_item = 0x564b180fbf10
[09:47:11.123]         save_log_statement_stats = false
[09:47:11.123]         was_logged = false
[09:47:11.123]         use_implicit_block = false
[09:47:11.123]         msec_str =
"\004\000\000\000\000\000\000\000\346[\376\026KV\000\000pR\005\021\374\177\000\000\335\000\000\000\000\000\000"
[09:47:11.123]         __func__ = "exec_simple_query"
[09:47:11.123] #28 0x0000564b16d1fe33 in PostgresMain
(dbname=<optimized out>, username=<optimized out>) at
../src/backend/tcop/postgres.c:4653
[09:47:11.123]         query_string = 0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...
[09:47:11.123]         firstchar = <optimized out>
[09:47:11.123]         input_message = {data = 0x564b180fa0e0 "with
recursive search_graph(f, t, label) as (\n\tselect * from graph0
g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph
sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect *
from search"..., len = 221, maxlen = 1024, cursor = 221}
[09:47:11.123]         local_sigjmp_buf = {{__jmpbuf =
{94880520543032, -8991887800862096751, 0, 4, 140720594048004, 1,
-3433879442247499119, -8991885813233991023}, __mask_was_saved = 1,
__saved_mask = {__val = {4194304, 1, 140346553036196, 94880526187920,
15616, 15680, 94880508418872, 0, 94880526187920, 15616,
94880520537224, 4, 140720594048004, 1, 94880508502235, 1}}}}
[09:47:11.123]         send_ready_for_query = false
[09:47:11.123]         idle_in_transaction_timeout_enabled = false
[09:47:11.123]         idle_session_timeout_enabled = false
[09:47:11.123]         __func__ = "PostgresMain"
[09:47:11.123] #29 0x0000564b16bcc4e4 in BackendRun
(port=port@entry=0x564b18126f50) at
../src/backend/postmaster/postmaster.c:4464

[1] - https://cirrus-ci.com/task/4765094966460416

Regards,
Vignesh

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

07 January 2024, 01:10:13

Hi,

vignesh C <vignesh21@gmail.com> writes:

> On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:
>>
>>
>> Andy Fan <zhihuifan1213@163.com> writes:
>>
>> >
>> > Some Known issues:
>> > ------------------
>> >
>> > 1. Currently only Scan & Join nodes are considered for this feature.
>> > 2. JIT is not adapted for this purpose yet.
>>
>> JIT is adapted for this feature in v2. Any feedback is welcome.
>
> One of the tests was aborted at CFBOT [1] with:
> [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
> /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
> [09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix.  Thanks for taking care of
this!

-- 
Best Regards
Andy Fan

Attachment

v3-0001-shared-detoast-feature.patch

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

08 January 2024, 09:52:57

Andy Fan <zhihuifan1213@163.com> writes:

>>
>> One of the tests was aborted at CFBOT [1] with:
>> [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
>> /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
>> [09:47:01.035] [New LWP 28182]
>
> There was a bug in JIT part, here is the fix.  Thanks for taking care of
> this!

Fixed a GCC warning in cirrus-ci, hope everything is fine now.

-- 
Best Regards
Andy Fan

Attachment

v4-0001-shared-detoast-feature.patch

Re: Shared detoast Datum proposal

From

Peter Smith

Date:

22 January 2024, 06:41:56

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

Kind Regards,
Peter Smith.

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

23 January 2024, 05:44:50

Hi,

Peter Smith <smithpb2250@gmail.com> writes:

> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ======
> [1] https://commitfest.postgresql.org/46/4759/
> [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue.  This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Tue Jan 23 13:38:34 2024 +0800

    shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Mon Jan 22 15:48:33 2024 +0800

    Introduce a Bitset data struct.

    While Bitmapset is designed for variable-length of bits, Bitset is
    designed for fixed-length of bits, the fixed length must be specified at
    the bitset_init stage and keep unchanged at the whole lifespan. Because
    of this, some operations on Bitset is simpler than Bitmapset.

    The bitset_clear unsets all the bits but kept the allocated memory, this
    capacity is impossible for bit Bitmapset for some solid reasons and this
    is the main reason to add this data struct.

    Also for performance aspect, the functions for Bitset removed some
    unlikely checks, instead with some Asserts.

    [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
    [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination. 

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Michael Zhilin

Date:

23 January 2024, 08:59:00

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and resend it?

Thanks,
Michael.

On 1/23/24 08:44, Andy Fan wrote:

Hi,

Peter Smith <smithpb2250@gmail.com> writes:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue.  This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Tue Jan 23 13:38:34 2024 +0800
    shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Mon Jan 22 15:48:33 2024 +0800
    Introduce a Bitset data struct.        While Bitmapset is designed for variable-length of bits, Bitset is    designed for fixed-length of bits, the fixed length must be specified at    the bitset_init stage and keep unchanged at the whole lifespan. Because    of this, some operations on Bitset is simpler than Bitmapset.        The bitset_clear unsets all the bits but kept the allocated memory, this    capacity is impossible for bit Bitmapset for some solid reasons and this    is the main reason to add this data struct.        Also for performance aspect, the functions for Bitset removed some    unlikely checks, instead with some Asserts.        [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com    [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com


I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination. 

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

-- 
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

23 January 2024, 19:18:18

Michael Zhilin <m.zhilin@postgrespro.ru> writes:

> Hi Andy,
>
> It looks like v5 is missing in your mail. Could you please check and resend it?

ha, yes.. v5 is really attached this time.

commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> shared_detoast_value_v3)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Tue Jan 23 13:38:34 2024 +0800

    shared detoast feature.
    
    details at https://postgr.es/m/87il4jrk1l.fsf%40163.com

commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Mon Jan 22 15:48:33 2024 +0800

    Introduce a Bitset data struct.
    
    While Bitmapset is designed for variable-length of bits, Bitset is
    designed for fixed-length of bits, the fixed length must be specified at
    the bitset_init stage and keep unchanged at the whole lifespan. Because
    of this, some operations on Bitset is simpler than Bitmapset.
    
    The bitset_clear unsets all the bits but kept the allocated memory, this
    capacity is impossible for bit Bitmapset for some solid reasons and this
    is the main reason to add this data struct.
    
    Also for performance aspect, the functions for Bitset removed some
    unlikely checks, instead with some Asserts.
    
    [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
    [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com


As for the commit "Introduce a Bitset data struct.", the test coverage
is 100% now. So it would be great that people can review this first. 

-- 
Best Regards
Andy Fan

Attachment

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

20 February 2024, 06:20:36

Hi,

I didn't another round of self-review.  Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

-- 
Best Regards
Andy Fan

Attachment

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

20 February 2024, 18:15:34

Hi,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...


The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc. Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away. I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.


FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly. Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration. But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

20 February 2024, 18:38:08

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas, 
>
> I took a quick look on this thread/patch today, so let me share a couple
> initial thoughts. I may not have a particularly coherent/consistent
> opinion on the patch or what would be a better way to do this yet, but
> perhaps it'll start a discussion ...

Thank you for this!

>
> The goal of the patch (as I understand it) is essentially to cache
> detoasted values, so that the value does not need to be detoasted
> repeatedly in different parts of the plan. I think that's perfectly
> sensible and worthwhile goal - detoasting is not cheap, and complex
> plans may easily spend a lot of time on it.

exactly.

>
> That being said, the approach seems somewhat invasive, and touching
> parts I wouldn't have expected to need a change to implement this. For
> example, I certainly would not have guessed the patch to need changes in
> createplan.c or setrefs.c.
>
> Perhaps it really needs to do these things, but neither the thread nor
> the comments are very enlightening as for why it's needed :-( In many
> cases I can guess, but I'm not sure my guess is correct. And comments in
> code generally describe what's happening locally / next line, not the
> bigger picture & why it's happening.

there were explaination at [1], but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers. 

> IIUC we walk the plan to decide which Vars should be detoasted (and
> cached) once, and which cases should not do that because it'd inflate
> the amount of data we need to keep in a Sort, Hash etc.

Exactly.

> Not sure if
> there's a better way to do this - it depends on what happens in the
> upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

> But maybe we could decide this while transforming the paths into a plan?
> (I realize the JIT thread nearby needs to do something like that in
> create_plan, and in that one I suggested maybe walking the plan would be
> a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is 
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

>
> In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
> explain the overall strategy / reasoning in a bit more detail. Maybe
> it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
 * set_plan_forbid_pre_detoast_vars_recurse
 *     Walking the Plan tree in the top-down manner to gather the vars which
 * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
 * 
 * plan: the plan node to walk right now.
 * small_tlist: a list of nodes which its subplan should provide them as
 * small as possible.
 */
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

>
> Similar for the setrefs.c changes. It seems a bit suspicious to piggy
> back the new code into fix_scan_expr/fix_scan_list and similar code.
> Those functions have a pretty clearly defined purpose, not sure we want
> to also extend them to also deal with this new thing. (FWIW I'd 100%%
> did it this way if I hacked on a PoC of this, to make it work. But I'm
> not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for 
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

> I don't know what to thing about the Bitset - maybe it's necessary, but
> how would I know? I don't have any way to measure the benefits, because
> the 0002 patch uses it right away. 

a revisted version of comments from the latest patch. graph 2 explains
this decision.

    /*
     * The attributes whose values are the detoasted version in tts_values[*],
     * if so these memory needs some extra clean-up. These memory can't be put
     * into ecxt_per_tuple_memory since many of them needs a longer life span,
     * for example the Datum in outer join. These memory is put into
     * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
     * invalidated.
     *
     * Bitset rather than Bitmapset is chosen here because when all the members
     * of Bitmapset are deleted, the allocated memory will be deallocated
     * automatically, which is too expensive in this case since we need to
     * deleted all the members in each ExecClearTuple and repopulate it again
     * when fill the detoast datum to tts_values[*]. This situation will be run
     * again and again in an execution cycle.
     * 
     * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
     */
    Bitset       *pre_detoasted_attrs;

> I think it should be done the other
> way around, i.e. the patch should introduce the main feature first
> (using the traditional Bitmapset), and then add Bitset on top of that.
> That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

> On the whole, my biggest concern is memory usage & leaks. It's not
> difficult to already have problems with large detoasted values, and if
> we start keeping more of them, that may get worse. Or at least that's my
> intuition - it can't really get better by keeping the values longer, right?
>
> The other thing is the risk of leaks (in the sense of keeping detoasted
> values longer than expected). I see the values are allocated in
> tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before,  but that's something we have to pay I
think. 

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
 * tts_values is treated invalidated since tts_nvalid is set to 0, so
 * let's free the pre-detoast datum.
 */
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

> FWIW while looking at the patch, I couldn't help but to think about
> expanded datums. There's similarity in what these two features do - keep
> detoasted values for a while, so that we don't need to do the expensive
> processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

> Of course, expanded datums are
> not meant to be long-lived, while "shared detoasted values" are meant to
> exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased. 

> But maybe there's something
> we could learn from expanded datums? For example how the varlena pointer
> is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
    VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
    Datum        oldDatum;
    MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

    oldDatum = slot->tts_values[attnum];
    slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
   (struct varlena *) oldDatum));

    Assert(slot->tts_nvalid > attnum);
    Assert(oldDatum != slot->tts_values[attnum]);
    bitset_add_member(slot->pre_detoasted_attrs, attnum);
    MemoryContextSwitchTo(old);
}

> For example, what if we add a "TOAST cache" as a query-level hash table,
> and modify the detoasting to first check the hash table (with the TOAST
> pointer as a key)? It'd be fairly trivial to enforce a memory limit on
> the hash table, evict values from it, etc. And it wouldn't require any
> of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example: 

In ExecInitExprRec:

switch (variable->varno)
{
    case INNER_VAR:
            if (is_join_plan(plan) &&
            bms_is_member(attnum,
              ((JoinState *) state->parent)->inner_pre_detoast_attrs))
        {
            scratch.opcode = EEOP_INNER_VAR_TOAST;
        }
        else
        {
            scratch.opcode = EEOP_INNER_VAR;
        }
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
        ...
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
..
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;
} Join;

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all. 

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Thank you very much for your feedback!

v7 attached, just some comments and Assert changes. 

[1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com
[2]
https://www.postgresql.org/message-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com

-- 
Best Regards
Andy Fan

Attachment

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

21 February 2024, 12:39:29

On 2/20/24 19:38, Andy Fan wrote:
> 
> ...
>
>> I think it should be done the other
>> way around, i.e. the patch should introduce the main feature first
>> (using the traditional Bitmapset), and then add Bitset on top of that.
>> That way we could easily measure the impact and see if it's useful.
> 
> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
> indicate it is too expensive. and after talk with David at [2], I
> introduced bitset and use it here. the test case I used comes from [1].
> IRCC, there were 5% performance difference because of this.
> 
> create table w(a int, b numeric);
> insert into w select i, i from generate_series(1, 1000000)i;
> select b from w where b > 0;
> 
> To reproduce the difference, we can replace the bitset_clear() with
> 
> bitset_free(slot->pre_detoasted_attrs);
> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
> 
> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
> 

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

> 
>> On the whole, my biggest concern is memory usage & leaks. It's not
>> difficult to already have problems with large detoasted values, and if
>> we start keeping more of them, that may get worse. Or at least that's my
>> intuition - it can't really get better by keeping the values longer, right?
>>
>> The other thing is the risk of leaks (in the sense of keeping detoasted
>> values longer than expected). I see the values are allocated in
>> tts_mcxt, and maybe that's the right solution - not sure.
> 
> about the memory usage, first it is kept as the same lifesplan as the
> tts_values[*] which can be released pretty quickly, only if the certain
> values of the tuples is not needed. it is true that we keep the detoast
> version longer than before,  but that's something we have to pay I
> think. 
> 
> Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
> we forget to release the memory when the tts_values[*] is invalidated
> somehow, the memory will be leaked until the end of executor. I think
> that will be enough to cause an issue. Currently besides I release such
> memory at the ExecClearTuple, I also relase such memory whenever we set
> tts_nvalid to 0, the theory used here is:
> 
> /*
>  * tts_values is treated invalidated since tts_nvalid is set to 0, so
>  * let's free the pre-detoast datum.
>  */
> ExecFreePreDetoastDatum(slot);
> 
> I will do more test on the memory leak stuff, since there are so many
> operation aginst slot like ExecCopySlot etc, I don't know how to test it
> fully. the method in my mind now is use TPCH with 10GB data size, and
> monitor the query runtime memory usage.
> 

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

> 
>> FWIW while looking at the patch, I couldn't help but to think about
>> expanded datums. There's similarity in what these two features do - keep
>> detoasted values for a while, so that we don't need to do the expensive
>> processing if we access them repeatedly.
> 
> Could you provide some keyword or function names for the expanded datum
> here, I probably miss this.
> 

see src/include/utils/expandeddatum.h

>> Of course, expanded datums are
>> not meant to be long-lived, while "shared detoasted values" are meant to
>> exist (potentially) for the query duration.
> 
> hmm, acutally the "shared detoast value" just live in the
> TupleTableSlot->tts_values[*], rather than the whole query duration. The
> simple case is:
> 
> SELECT * FROM t WHERE a_text LIKE 'abc%';
> 
> when we scan to the next tuple, the detoast value for the previous tuple
> will be relased.
> 

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

>> But maybe there's something
>> we could learn from expanded datums? For example how the varlena pointer
>> is leveraged to point to the expanded object.
> 
> maybe. currently I just use detoast_attr to get the desired version. I'm
> pleasure if we have more effective way.
> 
> if (!slot->tts_isnull[attnum] &&
>     VARATT_IS_EXTENDED(slot->tts_values[attnum]))
> {
>     Datum        oldDatum;
>     MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
>         
>     oldDatum = slot->tts_values[attnum];
>     slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
     (struct varlena *) oldDatum));
 
>     Assert(slot->tts_nvalid > attnum);
>     Assert(oldDatum != slot->tts_values[attnum]);
>     bitset_add_member(slot->pre_detoasted_attrs, attnum);
>     MemoryContextSwitchTo(old);
> }
> 

Right. FWIW I'm not sure if expanded objects are similar enough to be a
useful inspiration.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

> 
>> For example, what if we add a "TOAST cache" as a query-level hash table,
>> and modify the detoasting to first check the hash table (with the TOAST
>> pointer as a key)? It'd be fairly trivial to enforce a memory limit on
>> the hash table, evict values from it, etc. And it wouldn't require any
>> of the createplan/setrefs changes, I think ...
> 
> Hmm, I am not sure I understand you correctly at this part. In the
> current patch, to avoid the run-time (ExecExprInterp) check if we
> should detoast and save the datum, I defined 3 extra steps so that
> the *extra check itself* is not needed for unnecessary attributes.
> for example an datum for int or a detoast datum should not be saved back
> to tts_values[*] due to the small_tlist reason. However these steps can
> be generated is based on the output of createplan/setrefs changes. take
> the INNER_VAR for example: 
> 
> In ExecInitExprRec:
> 
> switch (variable->varno)
> {
>     case INNER_VAR:
>             if (is_join_plan(plan) &&
>             bms_is_member(attnum,
>               ((JoinState *) state->parent)->inner_pre_detoast_attrs))
>         {
>             scratch.opcode = EEOP_INNER_VAR_TOAST;
>         }
>         else
>         {
>             scratch.opcode = EEOP_INNER_VAR;
>         }
> }
> 
> The data workflow is:
> 
> 1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
> decides which Vars should *not* be pre_detoasted because of small_tlist
> reason and record it in Plan.forbid_pre_detoast_vars.
> 
> 2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
> detoasted for the specific plan node and record them in it. Currently
> only Scan and Join nodes support this feature.
> 
> typedef struct Scan
> {
>         ...
>     /*
>      * Records of var's varattno - 1 where the Var is accessed indirectly by
>      * any expression, like a > 3.  However a IS [NOT] NULL is not included
>      * since it doesn't access the tts_values[*] at all.
>      *
>      * This is a essential information to figure out which attrs should use
>      * the pre-detoast-attrs logic.
>      */
>     Bitmapset  *reference_attrs;
> } Scan;
> 
> typedef struct Join
> {
> ..
>     /*
>      * Records of var's varattno - 1 where the Var is accessed indirectly by
>      * any expression, like a > 3.  However a IS [NOT] NULL is not included
>      * since it doesn't access the tts_values[*] at all.
>      *
>      * This is a essential information to figure out which attrs should use
>      * the pre-detoast-attrs logic.
>      */
>     Bitmapset  *outer_reference_attrs;
>     Bitmapset  *inner_reference_attrs;
> } Join;
> 

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".


> 3. during the InitPlan stage, we maintain the
> PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.
> 
> 4. At the ExecExprInterp stage, only the new StepOp do the extra check
> to see if the detoast should happen. Other steps doesn't need this
> check at all. 
> 
> If we avoid the createplan/setref.c changes, probabaly some unrelated
> StepOp needs the extra check as well?
> 
> When I worked with the UniqueKey feature, I maintained a
> UniqueKey.README to summaried all the dicussed topics in threads, the
> README is designed to save the effort for more reviewer, I think I
> should apply the same logic for this feature.
> 

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

21 February 2024, 13:20:38


I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 2/20/24 19:38, Andy Fan wrote:
>> 
>> ...
>>
>>> I think it should be done the other
>>> way around, i.e. the patch should introduce the main feature first
>>> (using the traditional Bitmapset), and then add Bitset on top of that.
>>> That way we could easily measure the impact and see if it's useful.
>> 
>> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
>> indicate it is too expensive. and after talk with David at [2], I
>> introduced bitset and use it here. the test case I used comes from [1].
>> IRCC, there were 5% performance difference because of this.
>> 
>> create table w(a int, b numeric);
>> insert into w select i, i from generate_series(1, 1000000)i;
>> select b from w where b > 0;
>> 
>> To reproduce the difference, we can replace the bitset_clear() with
>> 
>> bitset_free(slot->pre_detoasted_attrs);
>> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
>> 
>> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
>> 
>
> I understand the bitset was not introduced until v5, after noticing the
> bitmapset is not quite efficient. But I still think the patches should
> be the other way around, i.e. the main feature first, then the bitset as
> an optimization.
>
> That allows everyone to observe the improvement on their own (without
> having to tweak the code), and it also doesn't require commit of the
> bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement. 

>> I will do more test on the memory leak stuff, since there are so many
>> operation aginst slot like ExecCopySlot etc, I don't know how to test it
>> fully. the method in my mind now is use TPCH with 10GB data size, and
>> monitor the query runtime memory usage.
>> 
>
> I think this is exactly the "high level design" description that should
> be in a comment, somewhere.

Got it.

>>> Of course, expanded datums are
>>> not meant to be long-lived, while "shared detoasted values" are meant to
>>> exist (potentially) for the query duration.
>> 
>> hmm, acutally the "shared detoast value" just live in the
>> TupleTableSlot->tts_values[*], rather than the whole query duration. The
>> simple case is:
>> 
>> SELECT * FROM t WHERE a_text LIKE 'abc%';
>> 
>> when we scan to the next tuple, the detoast value for the previous tuple
>> will be relased.
>> 
>
> But if the (detoasted) value is passed to the next executor node, it'll
> be kept, right?

Yes and only one copy for all the executor nodes.

> Unrelated question - could this actually end up being too aggressive?
> That is, could it detoast attributes that we end up not needing? For
> example, what if the tuple gets eliminated for some reason (e.g. because
> of a restriction on the table, or not having a match in a join)? Won't
> we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

>> typedef struct Join
>> {
>> ..
>>     /*
>>      * Records of var's varattno - 1 where the Var is accessed indirectly by
>>      * any expression, like a > 3.  However a IS [NOT] NULL is not included
>>      * since it doesn't access the tts_values[*] at all.
>>      *
>>      * This is a essential information to figure out which attrs should use
>>      * the pre-detoast-attrs logic.
>>      */
>>     Bitmapset  *outer_reference_attrs;
>>     Bitmapset  *inner_reference_attrs;
>> } Join;
>> 
>
> Is it actually necessary to add new fields to these nodes? Also, the
> names are not particularly descriptive of the purpose - it'd be better
> to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createplan.c and setref.c, I can't check the atttyplen
effectively, so even a Var with int type is still hold here which may
have nothing with detoast.

>> 
>> When I worked with the UniqueKey feature, I maintained a
>> UniqueKey.README to summaried all the dicussed topics in threads, the
>> README is designed to save the effort for more reviewer, I think I
>> should apply the same logic for this feature.
>> 
>
> Good idea. Either that (a separate README), or a comment in a header of
> some suitable .c/.h file (I prefer that, because that's kinda obvious
> when reading the code, I often not notice a README exists next to it).

Great, I'd try this from tomorrow. 

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

26 February 2024, 02:04:14


Hi,

>>
>> Good idea. Either that (a separate README), or a comment in a header of
>> some suitable .c/.h file (I prefer that, because that's kinda obvious
>> when reading the code, I often not notice a README exists next to it).
>
> Great, I'd try this from tomorrow. 

I have made it.  Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code.  Hope this is helpful!

The problem:
-------------

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

    /*
     * The attributes whose values are the detoasted version in tts_values[*],
     * if so these memory needs some extra clean-up. These memory can't be put
     * into ecxt_per_tuple_memory since many of them needs a longer life
     * span.
     *
     * These memory is put into TupleTableSlot.tts_mcxt and be clear
     * whenever the tts_values[*] is invalidated.
     */
    Bitmapset    *pre_detoast_attrs;
};

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] &&
        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        Datum        oldDatum;
        MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

        oldDatum = slot->tts_values[attnum];
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
                                                                (struct varlena *) oldDatum));
        Assert(slot->tts_nvalid > attnum);
        Assert(oldDatum != slot->tts_values[attnum]);
        slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum);
        MemoryContextSwitchTo(old);
    }
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for 
   example an expression like a > 3. However some known expressions like
   VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
    ..
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;
} Join;


Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!


Testing
-------

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt,  one in numeric_out.  this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms


Then we can see the perf report as below:

-    7.34%     0.00%  postgres  postgres           [.] ExecSlotDetoastDatum (inlined)
   - ExecSlotDetoastDatum (inlined)
      - 3.47% bms_add_member
       - 3.06% bms_make_singleton (inlined)
        - palloc0
         1.30% AllocSetAlloc

-    5.99%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
    2.64% bms_next_member
    1.17% bms_del_members
    0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms 

-    5.22%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
      - 2.82% bitset_next_member
       1.69% bms_next_member_internal (inlined)
       0.95% bitset_next_member
    0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms. 

So the final result for case 1: 

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms


Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

-- 
Best Regards
Andy Fan

Attachment

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

26 February 2024, 07:12:16

Hi!

I see this to be a very promising initiative, but some issues come into my mind.

When we store and detoast large values, say, 1Gb - that's a very likely scenario,

we have such cases from prod systems - we would end up in using a lot of shared

memory to keep these values alive, only to discard them later. Also, toasted values

are not always being used immediately and as a whole, i.e. jsonb values are fully

detoasted (we're working on this right now) to extract the smallest value from

big json, and these values are not worth keeping in memory. For text values too,

we often do not need the whole value to be detoasted and kept in memory.

What do you think?

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

26 February 2024, 13:22:39

Hi, 


> When we store and detoast large values, say, 1Gb - that's a very likely scenario,
> we have such cases from prod systems - we would end up in using a lot of shared
> memory to keep these values alive, only to discard them later.

First the feature can make sure if the original expression will not
detoast it, this feature will not detoast it as well. Based on this, if
there is a 1Gb datum, in the past, it took 1GB memory as well, but
it may be discarded quicker than the patched version. but patched
version will *not* keep it for too long time since once the
slot->tts_values[*] is invalidated, the memory will be released
immedately.

For example:

create table t(a text, b int);
insert into t select '1-gb-length-text' from generate_series(1, 100);

select * from t where a like '%gb%';

The patched version will take 1gb extra memory only.

Are you worried about this case or some other cases? 

> Also, toasted values
> are not always being used immediately and as a whole, i.e. jsonb values are fully
> detoasted (we're working on this right now) to extract the smallest value from
> big json, and these values are not worth keeping in memory. For text values too,
> we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.  

FWIW, as I shown in the test case, this feature is not only helpful for
big datum, it is also helpful for small toast datum. 

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

26 February 2024, 14:13:06

On 2/26/24 14:22, Andy Fan wrote:
> 
>...
>
>> Also, toasted values
>> are not always being used immediately and as a whole, i.e. jsonb values are fully
>> detoasted (we're working on this right now) to extract the smallest value from
>> big json, and these values are not worth keeping in memory. For text values too,
>> we often do not need the whole value to be detoasted.
> 
> I'm not sure how often this is true, espeically you metioned text data
> type. I can't image why people acquire a piece of text and how can we
> design a toasting system to fulfill such case without detoast the whole
> as the first step. But for the jsonb or array data type, I think it is
> more often. However if we design toasting like that, I'm not sure if it
> would slow down other user case. for example detoasting the whole piece
> use case. I am justing feeling both way has its user case, kind of heap
> store and columna store.  
> 

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often. In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be easier
to implement with the TOAST cache I briefly described ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

26 February 2024, 14:29:49

Hi,

Tomas, we already have a working jsonb partial detoast prototype,

and currently I'm porting it to the recent master. Due to the size

of the changes and very invasive nature it takes a lot of effort,

but it is already done. I'm also trying to make the core patch

less invasive. Actually, it is a subject for a separate topic, as soon

as I make it work on the current master we'll propose it

to the community.

Andy, thank you! I'll check the last patch set out and reply in a day or two.

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

26 February 2024, 15:29:33

> On 2/26/24 14:22, Andy Fan wrote:
>> 
>>...
>>
>>> Also, toasted values
>>> are not always being used immediately and as a whole, i.e. jsonb values are fully
>>> detoasted (we're working on this right now) to extract the smallest value from
>>> big json, and these values are not worth keeping in memory. For text values too,
>>> we often do not need the whole value to be detoasted.
>> 
>> I'm not sure how often this is true, espeically you metioned text data
>> type. I can't image why people acquire a piece of text and how can we
>> design a toasting system to fulfill such case without detoast the whole
>> as the first step. But for the jsonb or array data type, I think it is
>> more often. However if we design toasting like that, I'm not sure if it
>> would slow down other user case. for example detoasting the whole piece
>> use case. I am justing feeling both way has its user case, kind of heap
>> store and columna store.  
>> 
>
> Any substr/starts_with call benefits from this optimization, and such
> calls don't seem that uncommon. I certainly can imagine people doing
> this fairly often.

This leads me to pay more attention to pg_detoast_datum_slice user case
which I did overlook and caused some regression in this case. Thanks!

As you said later:

> Is there a way to identify cases that are likely to benefit from this
> slicing, and disable the detoasting for them? We already disable it for
> other cases, so maybe we can do this here too?

I think the answer is yes, if our goal is to disable the whole toast for
some speical calls like substr/starts_with. In the current patch, we have:

/*
 * increase_level_for_pre_detoast
 *    Check if the given Expr could detoast a Var directly, if yes,
 * increase the level and return true. otherwise return false;
 */
static inline void
increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
{
    /* The following nodes is impossible to detoast a Var directly. */
    if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
    {
        ctx->level_added = false;
    }
    else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
    {
        /* let's not detoast first so that pg_column_compression works. */
        ctx->level_added = false;
    }

while it works, but the blacklist looks not awesome.

> In any case, it's a long-standing behavior /
> optimization, and I don't think we can just dismiss it this quickly.

I agree. So I said both have their user case. and when I say the *both*, I
mean the other method is "partial detoast prototype", which Nikita has
told me before. while I am sure it has many user case, I'm also feeling
it is complex and have many questions in my mind, but I'd like to see
the design before asking them.

> Or maybe there's a way to do the detoasting lazily, and only keep the
> detoasted value when it's requesting the whole value. Or perhaps even
> better, remember what part we detoasted, and maybe it'll be enough for
> future requests?
>
> I'm not sure how difficult would this be with the proposed approach,
> which eagerly detoasts the value into tts_values. I think it'd be
> easier to implement with the TOAST cache I briefly described ... 

I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC: 

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

26 February 2024, 16:08:33

Nikita Malakhov <hukutoc@gmail.com> writes:

> Hi,
>
> Tomas, we already have a working jsonb partial detoast prototype,
> and currently I'm porting it to the recent master.

This is really awesome! Acutally when I talked to MySQL guys, they said
MySQL already did this and I admit it can resolve some different issues
than the current patch. Just I have many implemetion questions in my
mind. 

> Andy, thank you! I'll check the last patch set out and reply in a day
> or two.

Thank you for your attention!

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

02 March 2024, 12:33:26

Hi, Andy!

Sorry for the delay, I have had long flights this week.

I've reviewed the patch set, thank you for your efforts.

I have several notes about patch set code, but first of

I'm not sure the overall approach is the best for the task.

As Tomas wrote above, the approach is very invasive

and spreads code related to detoasting among many

parts of code.

Have you considered another one - to alter pg_detoast_datum

(actually, it would be detoast_attr function) and save

detoasted datums in the detoast context derived

from the query context?

We have just enough information at this step to identify

the datum - toast relation id and value id, and could

keep links to these detoasted values in a, say, linked list

or hash table. Thus we would avoid altering the executor

code and all detoast-related code would reside within

the detoast source files?

I'd check this approach in several days and would

report on the result here.

There are also comments on the code itself, I'd write them

a bit later.

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

03 March 2024, 01:52:16

Hi Nikita,

>
> Have you considered another one - to alter pg_detoast_datum
> (actually, it would be detoast_attr function) and save
> detoasted datums in the detoast context derived
> from the query context? 
>
> We have just enough information at this step to identify
> the datum - toast relation id and value id, and could
> keep links to these detoasted values in a, say, linked list
>  or hash table. Thus we would avoid altering the executor
> code and all detoast-related code would reside within
> the detoast source files?

I think you are talking about the way Tomas provided.  I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1], looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think? 

[1] https://www.postgresql.org/message-id/875xyb1a6q.fsf%40163.com

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

03 March 2024, 06:10:57

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion. 
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Sun Mar 3 13:48:25 2024 +0800

    shared detoast datum
    
    See the overall design & alternative design & testing in
    shared_detoast_datum.org

In the shared_detoast_datum.org, I added the alternative design part for
the idea of TOAST cache.

-- 
Best Regards
Andy Fan

Attachment

v9-0001-shared-detoast-datum.patch

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

03 March 2024, 20:08:20

On 3/3/24 02:52, Andy Fan wrote:
> 
> Hi Nikita,
> 
>>
>> Have you considered another one - to alter pg_detoast_datum
>> (actually, it would be detoast_attr function) and save
>> detoasted datums in the detoast context derived
>> from the query context? 
>>
>> We have just enough information at this step to identify
>> the datum - toast relation id and value id, and could
>> keep links to these detoasted values in a, say, linked list
>>  or hash table. Thus we would avoid altering the executor
>> code and all detoast-related code would reside within
>> the detoast source files?
> 
> I think you are talking about the way Tomas provided.  I am really
> afraid that I was thought of too self-opinionated, but I do have some
> concerns about this approch as I stated here [1], looks my concerns is
> still not addressed, or the concerns itself are too absurd which is
> really possible I think? 
> 

I'm not sure I understand your concerns. I can't speak for others, but I
did not consider you and your proposals too self-opinionated. You did
propose a solution that you consider the right one. That's fine. People
will question that and suggest possible alternatives. That's fine too,
it's why we have this list in the first place.

FWIW I'm not somehow sure the approach I suggested is guaranteed to be
better than "your" approach. Maybe there's some issue that I missed, for
example.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

03 March 2024, 20:29:42

On 2/26/24 16:29, Andy Fan wrote:
>
> ...>
> I can understand the benefits of the TOAST cache, but the following
> issues looks not good to me IIUC: 
> 
> 1). we can't put the result to tts_values[*] since without the planner
> decision, we don't know if this will break small_tlist logic. But if we
> put it into the cache only, which means a cache-lookup as a overhead is
> unavoidable.

True - if you're comparing having the detoasted value in tts_values[*]
directly with having to do a lookup in a cache, then yes, there's a bit
of an overhead.

But I think from the discussion it's clear having to detoast the value
into tts_values[*] has some weaknesses too, in particular:

- It requires decisions which attributes to detoast eagerly, which is
quite invasive (having to walk the plan, ...).

- I'm sure there will be cases where we choose to not detoast, but it
would be beneficial to detoast.

- Detoasting just the initial slices does not seem compatible with this.

IMHO the overhead of the cache lookup would be negligible compared to
the repeated detoasting of the value (which is the current baseline). I
somewhat doubt the difference compared to tts_values[*] will be even
measurable.

> 
> 2). It is hard to decide which entry should be evicted without attaching
> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
> we keep is the entry we will reuse soon?
> 

True. But is that really a problem? I imagined we'd set some sort of
memory limit on the cache (work_mem?), and evict oldest entries. So the
entries would eventually get evicted, and the memory limit would ensure
we don't consume arbitrary amounts of memory.

We could also add some "slot callback" but that seems unnecessary.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

03 March 2024, 20:41:18

On 3/3/24 07:10, Andy Fan wrote:
> 
> Hi,
> 
> Here is a updated version, the main changes are:
> 
> 1. an shared_detoast_datum.org file which shows the latest desgin and
> pending items during discussion. 
> 2. I removed the slot->pre_detoast_attrs totally.
> 3. handle some pg_detoast_datum_slice use case.
> 4. Some implementation improvement.
> 

I only very briefly skimmed the patch, and I guess most of my earlier
comments still apply. But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

04 March 2024, 01:17:27

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 3/3/24 02:52, Andy Fan wrote:
>> 
>> Hi Nikita,
>> 
>>>
>>> Have you considered another one - to alter pg_detoast_datum
>>> (actually, it would be detoast_attr function) and save
>>> detoasted datums in the detoast context derived
>>> from the query context? 
>>>
>>> We have just enough information at this step to identify
>>> the datum - toast relation id and value id, and could
>>> keep links to these detoasted values in a, say, linked list
>>>  or hash table. Thus we would avoid altering the executor
>>> code and all detoast-related code would reside within
>>> the detoast source files?
>> 
>> I think you are talking about the way Tomas provided.  I am really
>> afraid that I was thought of too self-opinionated, but I do have some
>> concerns about this approch as I stated here [1], looks my concerns is
>> still not addressed, or the concerns itself are too absurd which is
>> really possible I think? 
>> 
>
> I can't speak for others, but I did not consider you and your
> proposals too self-opinionated. 

This would be really great to know:)

> You did
> propose a solution that you consider the right one. That's fine. People
> will question that and suggest possible alternatives. That's fine too,
> it's why we have this list in the first place.

I agree.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

04 March 2024, 01:23:56

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 2/26/24 16:29, Andy Fan wrote:
>>
>> ...>
>> I can understand the benefits of the TOAST cache, but the following
>> issues looks not good to me IIUC: 
>> 
>> 1). we can't put the result to tts_values[*] since without the planner
>> decision, we don't know if this will break small_tlist logic. But if we
>> put it into the cache only, which means a cache-lookup as a overhead is
>> unavoidable.
>
> True - if you're comparing having the detoasted value in tts_values[*]
> directly with having to do a lookup in a cache, then yes, there's a bit
> of an overhead.
>
> But I think from the discussion it's clear having to detoast the value
> into tts_values[*] has some weaknesses too, in particular:
>
> - It requires decisions which attributes to detoast eagerly, which is
> quite invasive (having to walk the plan, ...).
>
> - I'm sure there will be cases where we choose to not detoast, but it
> would be beneficial to detoast.
>
> - Detoasting just the initial slices does not seem compatible with this.
>
> IMHO the overhead of the cache lookup would be negligible compared to
> the repeated detoasting of the value (which is the current baseline). I
> somewhat doubt the difference compared to tts_values[*] will be even
> measurable.
>
>> 
>> 2). It is hard to decide which entry should be evicted without attaching
>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>> we keep is the entry we will reuse soon?
>> 
>
> True. But is that really a problem? I imagined we'd set some sort of
> memory limit on the cache (work_mem?), and evict oldest entries. So the
> entries would eventually get evicted, and the memory limit would ensure
> we don't consume arbitrary amounts of memory.
>

Here is a copy from the shared_detoast_datum.org in the patch. I am
feeling about when / which entry to free is a key problem and run-time
(detoast_attr) overhead vs createplan.c overhead is a small difference
as well. the overhead I paid for createplan.c/setref.c looks not huge as
well. 

"""
A alternative design: toast cache
---------------------------------

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

          QUERY PLAN
------------------------------
 Nested Loop
   Join Filter: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

04 March 2024, 01:29:53

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 3/3/24 07:10, Andy Fan wrote:
>> 
>> Hi,
>> 
>> Here is a updated version, the main changes are:
>> 
>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>> pending items during discussion. 
>> 2. I removed the slot->pre_detoast_attrs totally.
>> 3. handle some pg_detoast_datum_slice use case.
>> 4. Some implementation improvement.
>> 
>
> I only very briefly skimmed the patch, and I guess most of my earlier
> comments still apply.

Yes, the overall design is not changed.

> But I'm a bit surprised the patch needs to pass a
> MemoryContext to so many places as a function argument - that seems to
> go against how we work with memory contexts. Doubly so because it seems
> to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:
 
 /* ----------
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *    Public entry point to get back a toasted value from compression
  *    or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * ----------
  */

+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that. 

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

04 March 2024, 12:07:58


On 3/4/24 02:29, Andy Fan wrote:
> 
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> 
>> On 3/3/24 07:10, Andy Fan wrote:
>>>
>>> Hi,
>>>
>>> Here is a updated version, the main changes are:
>>>
>>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>>> pending items during discussion. 
>>> 2. I removed the slot->pre_detoast_attrs totally.
>>> 3. handle some pg_detoast_datum_slice use case.
>>> 4. Some implementation improvement.
>>>
>>
>> I only very briefly skimmed the patch, and I guess most of my earlier
>> comments still apply.
> 
> Yes, the overall design is not changed.
> 
>> But I'm a bit surprised the patch needs to pass a
>> MemoryContext to so many places as a function argument - that seems to
>> go against how we work with memory contexts. Doubly so because it seems
>> to only ever pass CurrentMemoryContext anyway. So what's that about?
> 
> I think you are talking about the argument like this:
>  
>  /* ----------
> - * detoast_attr -
> + * detoast_attr_ext -
>   *
>   *    Public entry point to get back a toasted value from compression
>   *    or external storage.  The result is always non-extended varlena form.
>   *
> + * ctx: The memory context which the final value belongs to.
> + *
>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>   * datum, the result will be a pfree'able chunk.
>   * ----------
>   */
> 
> +extern struct varlena *
> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
> 
> This is mainly because 'detoast_attr' will apply more memory than the
> "final detoast datum" , for example the code to scan toast relation
> needs some memory as well, and what I want is just keeping the memory
> for the final detoast datum so that other memory can be released sooner,
> so I added the function argument for that. 
> 

What exactly does detoast_attr allocate in order to scan toast relation?
Does that happen in master, or just with the patch? If with master, I
suggest to ignore that / treat that as a separate issue and leave it for
a different patch.

In any case, the custom is to allocate results in the context that is
set in CurrentMemoryContext at the moment of the call, and if there's
substantial amount of allocations that we want to free soon, we either
do that by explicit pfree() calls, or create a small temporary context
in the function (but that's more expensive).

I don't think we should invent a new convention where the context is
passed as an argument, unless absolutely necessary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

04 March 2024, 12:13:20

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

>>> But I'm a bit surprised the patch needs to pass a
>>> MemoryContext to so many places as a function argument - that seems to
>>> go against how we work with memory contexts. Doubly so because it seems
>>> to only ever pass CurrentMemoryContext anyway. So what's that about?
>> 
>> I think you are talking about the argument like this:
>>  
>>  /* ----------
>> - * detoast_attr -
>> + * detoast_attr_ext -
>>   *
>>   *    Public entry point to get back a toasted value from compression
>>   *    or external storage.  The result is always non-extended varlena form.
>>   *
>> + * ctx: The memory context which the final value belongs to.
>> + *
>>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>>   * datum, the result will be a pfree'able chunk.
>>   * ----------
>>   */
>> 
>> +extern struct varlena *
>> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
>> 
>> This is mainly because 'detoast_attr' will apply more memory than the
>> "final detoast datum" , for example the code to scan toast relation
>> needs some memory as well, and what I want is just keeping the memory
>> for the final detoast datum so that other memory can be released sooner,
>> so I added the function argument for that. 
>> 
>
> What exactly does detoast_attr allocate in order to scan toast relation?
> Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating. 

> If with master, I
> suggest to ignore that / treat that as a separate issue and leave it for
> a different patch.

OK, I can make it as a seperate commit in the next version. 

> In any case, the custom is to allocate results in the context that is
> set in CurrentMemoryContext at the moment of the call, and if there's
> substantial amount of allocations that we want to free soon, we either
> do that by explicit pfree() calls, or create a small temporary context
> in the function (but that's more expensive).
>
> I don't think we should invent a new convention where the context is
> passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid.  I think we have a same decision to
make on the TOAST cache method as well.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

04 March 2024, 16:06:23

On 3/4/24 02:23, Andy Fan wrote:
> 
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> 
>> On 2/26/24 16:29, Andy Fan wrote:
>>>
>>> ...>
>>> I can understand the benefits of the TOAST cache, but the following
>>> issues looks not good to me IIUC: 
>>>
>>> 1). we can't put the result to tts_values[*] since without the planner
>>> decision, we don't know if this will break small_tlist logic. But if we
>>> put it into the cache only, which means a cache-lookup as a overhead is
>>> unavoidable.
>>
>> True - if you're comparing having the detoasted value in tts_values[*]
>> directly with having to do a lookup in a cache, then yes, there's a bit
>> of an overhead.
>>
>> But I think from the discussion it's clear having to detoast the value
>> into tts_values[*] has some weaknesses too, in particular:
>>
>> - It requires decisions which attributes to detoast eagerly, which is
>> quite invasive (having to walk the plan, ...).
>>
>> - I'm sure there will be cases where we choose to not detoast, but it
>> would be beneficial to detoast.
>>
>> - Detoasting just the initial slices does not seem compatible with this.
>>
>> IMHO the overhead of the cache lookup would be negligible compared to
>> the repeated detoasting of the value (which is the current baseline). I
>> somewhat doubt the difference compared to tts_values[*] will be even
>> measurable.
>>
>>>
>>> 2). It is hard to decide which entry should be evicted without attaching
>>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>>> we keep is the entry we will reuse soon?
>>>
>>
>> True. But is that really a problem? I imagined we'd set some sort of
>> memory limit on the cache (work_mem?), and evict oldest entries. So the
>> entries would eventually get evicted, and the memory limit would ensure
>> we don't consume arbitrary amounts of memory.
>>
> 
> Here is a copy from the shared_detoast_datum.org in the patch. I am
> feeling about when / which entry to free is a key problem and run-time
> (detoast_attr) overhead vs createplan.c overhead is a small difference
> as well. the overhead I paid for createplan.c/setref.c looks not huge as
> well. 

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is localized (in temporal
sense), i.e. we read it from disk, detoast it a couple times in a short
time interval, and then move to the next row. That'd mean v2 is probably
the correct trade off.

Not sure.

> 
> """
> A alternative design: toast cache
> ---------------------------------
> 
> This method is provided by Tomas during the review process. IIUC, this
> method would maintain a local HTAB which map a toast datum to a detoast
> datum and the entry is maintained / used in detoast_attr
> function. Within this method, the overall design is pretty clear and the
> code modification can be controlled in toasting system only.
> 

Right.

> I assumed that releasing all of the memory at the end of executor once
> is not an option since it may consumed too many memory. Then, when and
> which entry to release becomes a trouble for me. For example:
> 
>           QUERY PLAN
> ------------------------------
>  Nested Loop
>    Join Filter: (t1.a = t2.a)
>    ->  Seq Scan on t1
>    ->  Seq Scan on t2
> (4 rows)
> 
> In this case t1.a needs a longer lifespan than t2.a since it is
> in outer relation. Without the help from slot's life-cycle system, I
> can't think out a answer for the above question.
> 

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

It seems far easier to limit the memory with the toast cache.

> Another difference between the 2 methods is my method have many
> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
> it can save some run-time effort like hash_search find / enter run-time
> in method 2 since I put them directly into tts_values[*].
> 
> I'm not sure the factor 2 makes some real measurable difference in real
> case, so my current concern mainly comes from factor 1.
> """
> 

This seems a bit dismissive of the (possible) issue. It'd be good to do
some measurements, especially on simple queries that can't benefit from
the detoasting (e.g. because there's no varlena attribute).

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

04 March 2024, 17:08:29

>
> I decided to whip up a PoC patch implementing this approach, to get a
> better idea of how it compares to the original proposal, both in terms
> of performance and code complexity.
>
> Attached are two patches that both add a simple cache in detoast.c, but
> differ in when exactly the caching happens.
>
> toast-cache-v1 - caching happens in toast_fetch_datum, which means this
> happens before decompression
>
> toast-cache-v2 - caching happens in detoast_attr, after decompression
...
>
> I think implementation-wise this is pretty non-invasive.
..
>
> Performance-wise, these patches are slower than with Andy's patch. For
> example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
> v2 at ~150ms. I'm sure there's a number of optimization opportunities
> and v2 could get v2 closer to the 100ms.
>
> v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
> as master, because the data set is small enough to be fully cached, so
> there's no I/O and it's the decompression is the actual bottleneck.
>
> That being said, I'm not convinced v1 would be a bad choice for some
> cases. If there's memory pressure enough to evict TOAST, it's quite
> possible the I/O would become the bottleneck. At which point it might be
> a good trade off to cache compressed data, because then we'd cache more
> detoasted values.
>
> OTOH it's likely the access to TOAST values is localized (in temporal
> sense), i.e. we read it from disk, detoast it a couple times in a short
> time interval, and then move to the next row. That'd mean v2 is probably
> the correct trade off.

I don't have different thought about the above statement and Thanks for
the PoC code which would make later testing much easier. 

>> 
>> """
>> A alternative design: toast cache
>> ---------------------------------
>> 
>> This method is provided by Tomas during the review process. IIUC, this
>> method would maintain a local HTAB which map a toast datum to a detoast
>> datum and the entry is maintained / used in detoast_attr
>> function. Within this method, the overall design is pretty clear and the
>> code modification can be controlled in toasting system only.
>> 
>
> Right.
>
>> I assumed that releasing all of the memory at the end of executor once
>> is not an option since it may consumed too many memory. Then, when and
>> which entry to release becomes a trouble for me. For example:
>> 
>>           QUERY PLAN
>> ------------------------------
>>  Nested Loop
>>    Join Filter: (t1.a = t2.a)
>>    ->  Seq Scan on t1
>>    ->  Seq Scan on t2
>> (4 rows)
>> 
>> In this case t1.a needs a longer lifespan than t2.a since it is
>> in outer relation. Without the help from slot's life-cycle system, I
>> can't think out a answer for the above question.
>> 
>
> This is true, but how likely such plans are? I mean, surely no one would
> do nested loop with sequential scans on reasonably large tables, so how
> representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common. 

> Also, this leads me to the need of having some sort of memory limit. If
> we may be keeping entries for extended periods of time, and we don't
> have any way to limit the amount of memory, that does not seem great.
>
> AFAIK if we detoast everything into tts_values[] there's no way to
> implement and enforce such limit. What happens if there's a row with
> multiple large-ish TOAST values? What happens if those rows are in
> different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

> It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.


>> Another difference between the 2 methods is my method have many
>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>> it can save some run-time effort like hash_search find / enter run-time
>> in method 2 since I put them directly into tts_values[*].
>> 
>> I'm not sure the factor 2 makes some real measurable difference in real
>> case, so my current concern mainly comes from factor 1.
>> """
>> 
>
> This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

> It'd be good to do
> some measurements, especially on simple queries that can't benefit from
> the detoasting (e.g. because there's no varlena attribute).

This testing for the queries which have no varlena attribute was done at
the very begining of this thread, and for now, the code is much more
nicer for this sistuation. all the changes in execExpr.c
execExprInterp.c has no impact on this. Only the plan walker codes
matter. Here is a test based on tenk1. 

q1: explain select count(*) from tenk1 where odd > 10 and even > 30;
q2: select count(*) from tenk1 where odd > 10 and even > 30;

pgbench -n -f qx.sql regression -T10

| Query | master (ms) | patched (ms) |
|-------+-------------+--------------|
| q1    |       0.129 |        0.129 |
| q2    |       1.876 |        1.870 |

there are some error for patched-q2 combination, but at least it can
show it doesn't cause noticable regression.

> In any case, my concern is more about having to do this when creating
> the plan at all, the code complexity etc. Not just because it might have
> performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.
2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB. 

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

04 March 2024, 20:07:04

On 3/4/24 18:08, Andy Fan wrote:
> ...
>>
>>> I assumed that releasing all of the memory at the end of executor once
>>> is not an option since it may consumed too many memory. Then, when and
>>> which entry to release becomes a trouble for me. For example:
>>>
>>>           QUERY PLAN
>>> ------------------------------
>>>  Nested Loop
>>>    Join Filter: (t1.a = t2.a)
>>>    ->  Seq Scan on t1
>>>    ->  Seq Scan on t2
>>> (4 rows)
>>>
>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>> in outer relation. Without the help from slot's life-cycle system, I
>>> can't think out a answer for the above question.
>>>
>>
>> This is true, but how likely such plans are? I mean, surely no one would
>> do nested loop with sequential scans on reasonably large tables, so how
>> representative this example is?
> 
> Acutally this is a simplest Join case, we still have same problem like
> Nested Loop + Index Scan which will be pretty common. 
> 

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

>> Also, this leads me to the need of having some sort of memory limit. If
>> we may be keeping entries for extended periods of time, and we don't
>> have any way to limit the amount of memory, that does not seem great.
>>
>> AFAIK if we detoast everything into tts_values[] there's no way to
>> implement and enforce such limit. What happens if there's a row with
>> multiple large-ish TOAST values? What happens if those rows are in
>> different (and distant) parts of the plan?
> 
> I think this can be done by tracking the memory usage on EState level
> or global variable level and disable it if it exceeds the limits and
> resume it when we free the detoast datum when we don't need it. I think
> no other changes need to be done.
> 

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

>> It seems far easier to limit the memory with the toast cache.
> 
> I think the memory limit and entry eviction is the key issue now. IMO,
> there are still some difference when both methods can support memory
> limit. The reason is my patch can grantee the cached memory will be
> reused, so if we set the limit to 10MB, we know all the 10MB is
> useful, but the TOAST cache method, probably can't grantee that, then
> when we want to make it effecitvely, we have to set a higher limit for
> this.
> 

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

> 
>>> Another difference between the 2 methods is my method have many
>>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>>> it can save some run-time effort like hash_search find / enter run-time
>>> in method 2 since I put them directly into tts_values[*].
>>>
>>> I'm not sure the factor 2 makes some real measurable difference in real
>>> case, so my current concern mainly comes from factor 1.
>>> """
>>>
>>
>> This seems a bit dismissive of the (possible) issue.
> 
> Hmm, sorry about this, that is not my intention:(
> 

No need to apologize.

>> It'd be good to do
>> some measurements, especially on simple queries that can't benefit from
>> the detoasting (e.g. because there's no varlena attribute).
> 
> This testing for the queries which have no varlena attribute was done at
> the very begining of this thread, and for now, the code is much more
> nicer for this sistuation. all the changes in execExpr.c
> execExprInterp.c has no impact on this. Only the plan walker codes
> matter. Here is a test based on tenk1. 
> 
> q1: explain select count(*) from tenk1 where odd > 10 and even > 30;
> q2: select count(*) from tenk1 where odd > 10 and even > 30;
> 
> pgbench -n -f qx.sql regression -T10
> 
> | Query | master (ms) | patched (ms) |
> |-------+-------------+--------------|
> | q1    |       0.129 |        0.129 |
> | q2    |       1.876 |        1.870 |
> 
> there are some error for patched-q2 combination, but at least it can
> show it doesn't cause noticable regression.
> 

OK, I'll try to do some experiments with these cases too.

>> In any case, my concern is more about having to do this when creating
>> the plan at all, the code complexity etc. Not just because it might have
>> performance impact.
> 
> I think the main trade-off is TOAST cache method is pretty non-invasive
> but can't control the eviction well, the impacts includes:
> 1. may evicting the datum we want and kept the datum we don't need.

This applies to any eviction algorithm, not just LRU. Ultimately what
matters is whether we have in the cache the most often used values, i.e.
the hit ratio (perhaps in combination with how expensive detoasting that
particular entry was).

> 2. more likely to use up all the memory which is allowed. for example:
> if we set the limit to 1MB, then we kept more data which will be not
> used and then consuming all of the 1MB. 
> 
> My method is resolving this with some helps from other modules (kind of
> invasive) but can control the eviction well and use the memory as less
> as possible.
> 

Is the memory usage really an issue? Sure, it'd be nice to evict entries
as soon as we can prove they are not needed anymore, but if the cache
limit is set to 1MB it's not really a problem to use 1MB.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

05 March 2024, 01:44:33

>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB. 
>> 
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>> 
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit 
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB. 

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
>>>> I assumed that releasing all of the memory at the end of executor once
>>>> is not an option since it may consumed too many memory. Then, when and
>>>> which entry to release becomes a trouble for me. For example:
>>>>
>>>>           QUERY PLAN
>>>> ------------------------------
>>>>  Nested Loop
>>>>    Join Filter: (t1.a = t2.a)
>>>>    ->  Seq Scan on t1
>>>>    ->  Seq Scan on t2
>>>> (4 rows)
>>>>
>>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>>> in outer relation. Without the help from slot's life-cycle system, I
>>>> can't think out a answer for the above question.
>>>>
>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>> 
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common. 
>> 
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not 
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the 
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discribed at the beginning
of this writing. 

>>> Also, this leads me to the need of having some sort of memory limit. If
>>> we may be keeping entries for extended periods of time, and we don't
>>> have any way to limit the amount of memory, that does not seem great.
>>>
>>> AFAIK if we detoast everything into tts_values[] there's no way to
>>> implement and enforce such limit. What happens if there's a row with
>>> multiple large-ish TOAST values? What happens if those rows are in
>>> different (and distant) parts of the plan?
>> 
>> I think this can be done by tracking the memory usage on EState level
>> or global variable level and disable it if it exceeds the limits and
>> resume it when we free the detoast datum when we don't need it. I think
>> no other changes need to be done.
>> 
>
> That seems like a fair amount of additional complexity. And what if the
> toasted values are accessed in context without EState (I haven't checked
> how common / important that is)?
>
> And I'm not sure just disabling the detoast as a way to enforce a memory
> limit, as explained earlier.
>
>>> It seems far easier to limit the memory with the toast cache.
>> 
>> I think the memory limit and entry eviction is the key issue now. IMO,
>> there are still some difference when both methods can support memory
>> limit. The reason is my patch can grantee the cached memory will be
>> reused, so if we set the limit to 10MB, we know all the 10MB is
>> useful, but the TOAST cache method, probably can't grantee that, then
>> when we want to make it effecitvely, we have to set a higher limit for
>> this.
>> 
>
> Can it actually guarantee that? It can guarantee the slot may be used,
> but I don't see how could it guarantee the detoasted value will be used.
> We may be keeping the slot for other reasons. And even if it could
> guarantee the detoasted value will be used, does that actually prove
> it's better to keep that value? What if it's only used once, but it's
> blocking detoasting of values that will be used 10x that?
>
> If we detoast a 10MB value in the outer side of the Nest Loop, what if
> the inner path has multiple accesses to another 10MB value that now
> can't be detoasted (as a shared value)?

Grarantee may be wrong word. The difference in my mind are:
1. plan shape have better potential to know the user case of datum,
since we know the plan tree and knows the rows pass to a given node.
2. Planning time effort is cheaper than run-time effort.
3. eviction in my method is not as important as it is in TOAST cache
method since it is reset per slot, so usually it doesn't hit limit in
fact. But as a cache, it does.  
4. use up to memory limit we set in TOAST cache case. 

>>> In any case, my concern is more about having to do this when creating
>>> the plan at all, the code complexity etc. Not just because it might have
>>> performance impact.
>> 
>> I think the main trade-off is TOAST cache method is pretty non-invasive
>> but can't control the eviction well, the impacts includes:
>> 1. may evicting the datum we want and kept the datum we don't need.
>
> This applies to any eviction algorithm, not just LRU. Ultimately what
> matters is whether we have in the cache the most often used values, i.e.
> the hit ratio (perhaps in combination with how expensive detoasting that
> particular entry was).

Correct, just that I am doubtful about design a LOCAL CACHE for USER
DATA with the reason I described above.

At last, thanks for your attention, really appreciated about it!

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

05 March 2024, 09:03:35

Hi,

Tomas, sorry for confusion, in my previous message I meant exactly

the same approach you've posted above, and came up with almost

the same implementation.

Thank you very much for your attention to this thread!

I've asked Andy about this approach because of the same reasons you

mentioned - it keeps cache code small, localized and easy to maintain.

The question that worries me is the memory limit, and I think that cache

lookup and entry invalidation should be done in toast_tuple_externalize

code, for the case the value has been detoasted previously is updated

in the same query, like

UPDATE test SET value = value || '...';

I've added cache entry invalidation and data removal on delete and update

of the toasted values and am currently experimenting with large values.

On Tue, Mar 5, 2024 at 5:59 AM Andy Fan <zhihuifan1213@163.com> wrote:

>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB.
>>
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>>
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB.

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
>>>> I assumed that releasing all of the memory at the end of executor once
>>>> is not an option since it may consumed too many memory. Then, when and
>>>> which entry to release becomes a trouble for me. For example:
>>>>
>>>> QUERY PLAN
>>>> ------------------------------
>>>> Nested Loop
>>>> Join Filter: (t1.a = t2.a)
>>>> -> Seq Scan on t1
>>>> -> Seq Scan on t2
>>>> (4 rows)
>>>>
>>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>>> in outer relation. Without the help from slot's life-cycle system, I
>>>> can't think out a answer for the above question.
>>>>
>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>>
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common.
>>
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discribed at the beginning
of this writing.

>>> Also, this leads me to the need of having some sort of memory limit. If
>>> we may be keeping entries for extended periods of time, and we don't
>>> have any way to limit the amount of memory, that does not seem great.
>>>
>>> AFAIK if we detoast everything into tts_values[] there's no way to
>>> implement and enforce such limit. What happens if there's a row with
>>> multiple large-ish TOAST values? What happens if those rows are in
>>> different (and distant) parts of the plan?
>>
>> I think this can be done by tracking the memory usage on EState level
>> or global variable level and disable it if it exceeds the limits and
>> resume it when we free the detoast datum when we don't need it. I think
>> no other changes need to be done.
>>
>
> That seems like a fair amount of additional complexity. And what if the
> toasted values are accessed in context without EState (I haven't checked
> how common / important that is)?
>
> And I'm not sure just disabling the detoast as a way to enforce a memory
> limit, as explained earlier.
>
>>> It seems far easier to limit the memory with the toast cache.
>>
>> I think the memory limit and entry eviction is the key issue now. IMO,
>> there are still some difference when both methods can support memory
>> limit. The reason is my patch can grantee the cached memory will be
>> reused, so if we set the limit to 10MB, we know all the 10MB is
>> useful, but the TOAST cache method, probably can't grantee that, then
>> when we want to make it effecitvely, we have to set a higher limit for
>> this.
>>
>
> Can it actually guarantee that? It can guarantee the slot may be used,
> but I don't see how could it guarantee the detoasted value will be used.
> We may be keeping the slot for other reasons. And even if it could
> guarantee the detoasted value will be used, does that actually prove
> it's better to keep that value? What if it's only used once, but it's
> blocking detoasting of values that will be used 10x that?
>
> If we detoast a 10MB value in the outer side of the Nest Loop, what if
> the inner path has multiple accesses to another 10MB value that now
> can't be detoasted (as a shared value)?

Grarantee may be wrong word. The difference in my mind are:
1. plan shape have better potential to know the user case of datum,
since we know the plan tree and knows the rows pass to a given node.
2. Planning time effort is cheaper than run-time effort.
3. eviction in my method is not as important as it is in TOAST cache
method since it is reset per slot, so usually it doesn't hit limit in
fact. But as a cache, it does.
4. use up to memory limit we set in TOAST cache case.

>>> In any case, my concern is more about having to do this when creating
>>> the plan at all, the code complexity etc. Not just because it might have
>>> performance impact.
>>
>> I think the main trade-off is TOAST cache method is pretty non-invasive
>> but can't control the eviction well, the impacts includes:
>> 1. may evicting the datum we want and kept the datum we don't need.
>
> This applies to any eviction algorithm, not just LRU. Ultimately what
> matters is whether we have in the cache the most often used values, i.e.
> the hit ratio (perhaps in combination with how expensive detoasting that
> particular entry was).

Correct, just that I am doubtful about design a LOCAL CACHE for USER
DATA with the reason I described above.

At last, thanks for your attention, really appreciated about it!

--
Best Regards
Andy Fan

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

06 March 2024, 06:09:33

Hi,

In addition to the previous message - for the toast_fetch_datum_slice

the first (seems obvious) way is to detoast the whole value, save it

to cache and get slices from it on demand. I have another one on my

mind, but have to play with it first.

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

06 March 2024, 20:42:38

On 3/6/24 07:09, Nikita Malakhov wrote:
> Hi,
> 
> In addition to the previous message - for the toast_fetch_datum_slice
> the first (seems obvious) way is to detoast the whole value, save it
> to cache and get slices from it on demand. I have another one on my
> mind, but have to play with it first.
> 

What if you only need the first slice? In that case decoding everything
will be a clear regression.

IMHO this needs to work like this:

1) check if the cache already has sufficiently large slice detoasted
(and use the cached value if yes)

2) if there's no slice detoasted, or if it's too small, detoast a
sufficiently large slice and store it in the cache (possibly evicting
the already detoasted slice)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

06 March 2024, 20:58:48

On 3/5/24 10:03, Nikita Malakhov wrote:
> Hi,
> 
> Tomas, sorry for confusion, in my previous message I meant exactly
> the same approach you've posted above, and came up with almost
> the same implementation.
> 
> Thank you very much for your attention to this thread!
> 
> I've asked Andy about this approach because of the same reasons you
> mentioned - it keeps cache code small, localized and easy to maintain.
> 
> The question that worries me is the memory limit, and I think that cache
> lookup and entry invalidation should be done in toast_tuple_externalize
> code, for the case the value has been detoasted previously is updated
> in the same query, like
> 

I haven't thought very much about when we need to invalidate entries and
where exactly should that happen. My patch is a very rough PoC that I
wrote in ~1h, and it's quite possible it will have to move or should be
refactored in some way.

But I don't quite understand the problem with this query:

> UPDATE test SET value = value || '...';

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

I'm not sure why should anything happen in toast_tuple_externalize, that
seems like a very low-level piece of code. But I realized there's also
toast_delete_external, and maybe that's a good place to invalidate the
deleted TOAST values (which would also cover the UPDATE).

> I've added cache entry invalidation and data removal on delete and update
> of the toasted values and am currently experimenting with large values.
> 

I haven't done anything about large values in the PoC patch, but I think
it might be a good idea to either ignore values that are too large (so
that it does not push everything else from the cache) or store them in
compressed form (assuming it's small enough, to at least save the I/O).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

07 March 2024, 07:33:18

Hi!

Tomas, I thought about this issue -

>What if you only need the first slice? In that case decoding everything

>will be a clear regression.

And completely agree with you, I'm currently working on it and will post

a patch a bit later. Another issue we have here - if we have multiple

slices requested - then we have to update cached slice from both sides,

the beginning and the end.

On update, yep, you're right

>Surely the update creates a new TOAST value, with a completely new TOAST
>pointer, new rows in the TOAST table etc. It's not updated in-place. So
>it'd end up with two independent entries in the TOAST cache.

>Or are you interested just in evicting the old value simply to free the
>memory, because we're not going to need that (now invisible) value? That
>seems interesting, but if it's too invasive or complex I'd just leave it
>up to the LRU eviction (at least for v1).

Again, yes, we do not need the old value after it was updated and

it is better to take care of it explicitly. It's a simple and not invasive

addition to your patch.

Could not agree with you about large values - it makes sense

to cache large values because the larger it is the more benefit

we will have from not detoasting it again and again. One way

is to keep them compressed, but what if we have data with very

low compression rate?

I also changed HASHACTION field value from HASH_ENTER to

HASH_ENTER_NULL to softly deal with case when we do not

have enough memory and added checks for if the value was stored

or not.

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Tomas Vondra

Date:

07 March 2024, 11:44:41

On 3/7/24 08:33, Nikita Malakhov wrote:
> Hi!
> 
> Tomas, I thought about this issue -
>> What if you only need the first slice? In that case decoding everything
>> will be a clear regression.
> And completely agree with you, I'm currently working on it and will post
> a patch a bit later.

Cool. I don't plan to work on improving my patch - it was merely a PoC,
so if you're interested in working on that, that's good.

> Another issue we have here - if we have multiple slices requested -
> then we have to update cached slice from both sides, the beginning
> and the end.
> 

No opinion. I haven't thought about how to handle slices too much.

> On update, yep, you're right
>> Surely the update creates a new TOAST value, with a completely new TOAST
>> pointer, new rows in the TOAST table etc. It's not updated in-place. So
>> it'd end up with two independent entries in the TOAST cache.
> 
>> Or are you interested just in evicting the old value simply to free the
>> memory, because we're not going to need that (now invisible) value? That
>> seems interesting, but if it's too invasive or complex I'd just leave it
>> up to the LRU eviction (at least for v1).
> Again, yes, we do not need the old value after it was updated and
> it is better to take care of it explicitly. It's a simple and not invasive
> addition to your patch.
> 

OK

> Could not agree with you about large values - it makes sense
> to cache large values because the larger it is the more benefit
> we will have from not detoasting it again and again. One way
> is to keep them compressed, but what if we have data with very
> low compression rate?
> 

I'm not against caching large values, but I also think it's not
difficult to construct cases where it's a losing strategy.

> I also changed HASHACTION field value from HASH_ENTER to
> HASH_ENTER_NULL to softly deal with case when we do not
> have enough memory and added checks for if the value was stored
> or not.
> 

I'm not sure about this. HASH_ENTER_NULL is only being used in three
very specific places (two are lock management, one is shmeminitstruct).
This hash table is not unique in any way, so why not to use HASH_ENTER
like 99% other hash tables?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

15 March 2024, 08:21:41

Hi!

Here's a slightly improved version of patch Tomas provided above (v2),

with cache invalidations and slices caching added, still as PoC.

The main issue I've encountered during tests is that when the same query

retrieves both slices and full value - slices, like substring(t,...) the order of

the values is not predictable, with text fields substrings are retrieved

before the full value and we cannot benefit from cache full value first

and use slices from cached value.

Yet the cache code is still very compact and affects only sources related

to detoasting.

Tomas, about HASH_ENTER - according to comments it could throw

an OOM error, so I've changed it to HASH_ENTER_NULL to avoid

new errors. In this case we would just have the value not cached

without an error.

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Attachment

0001-toast-cache-v3.patch

Re: Shared detoast Datum proposal

From

Robert Haas

Date:

20 May 2024, 15:12:34

Hi,

Andy Fan asked me off-list for some feedback about this proposal. I
have hesitated to comment on it for lack of having studied the matter
in any detail, but since I've been asked for my input, here goes:

I agree that there's a problem here, but I suspect this is not the
right way to solve it. Let's compare this to something like the
syscache. Specifically, let's think about the syscache on
pg_class.relname.

In the case of the syscache on pg_class.relname, there are two reasons
why we might repeatedly look up the same values in the cache. One is
that the code might do multiple name lookups when it really ought to
do only one. Doing multiple lookups is bad for security and bad for
performance, but we have code like that in some places and some of it
is hard to fix. The other reason is that it is likely that the user
will mention the same table names repeatedly; each time they do, they
will trigger a lookup on the same name. By caching the result of the
lookup, we can make it much faster. An important point to note here is
that the queries that the user will issue in the future are unknown to
us. In a certain sense, we cannot even know whether the same table
name will be mentioned more than once in the same query: when we reach
the first instance of the table name and look it up, we do not have
any good way of knowing whether it will be mentioned again later, say
due to a self-join. Hence, the pattern of cache lookups is
fundamentally unknowable.

But that's not the case in the motivating example that started this
thread. In that case, the target list includes three separate
references to the same column. We can therefore predict that if the
column value is toasted, we're going to de-TOAST it exactly three
times. If, on the other hand, the column value were mentioned only
once, it would be detoasted just once. In that latter case, which is
probably quite common, this whole cache idea seems like it ought to
just waste cycles, which makes me suspect that no matter what is done
to this patch, there will always be cases where it causes significant
regressions. In the former case, where the column reference is
repeated, it will win, but it will also hold onto the detoasted value
after the third instance of detoasting, even though there will never
be any more cache hits. Here, unlike the syscache case, the cache is
thrown away at the end of the query, so future queries can't benefit.
Even if we could find a way to preserve the cache in some cases, it's
not very likely to pay off. People are much more likely to hit the
same table more than once than to hit the same values in the same
table more than once in the same session.

Suppose we had a design where, when a value got detoasted, the
detoasted value went into the place where the toasted value had come
from. Then this problem would be handled pretty much perfectly: the
detoasted value would last until the scan advanced to the next row,
and then it would be thrown out. So there would be no query-lifespan
memory leak. Now, I don't really know how this would work, because the
TOAST pointer is coming out of a slot and we can't necessarily write
one attribute of any arbitrary slot type without a bunch of extra
overhead, but maybe there's some way. If it could be done cheaply
enough, it would gain all the benefits of this proposal a lot more
cheaply and with fewer downsides. Alternatively, maybe there's a way
to notice the multiple references and introduce some kind of
intermediate slot or other holding area where the detoasted value can
be stored.

In other words, instead of computing

big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'

Maybe we ought to be trying to compute this:

big_jsonb_col=detoast(big_jsonb_col)
big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'

Or perhaps this:

tmp=detoast(big_jsonb_col)
tmp->'a', tmp->'b', tmp->'c'

In still other words, a query-lifespan cache for this information
feels like it's tackling the problem at the wrong level. I do suspect
there may be query shapes where the same value gets detoasted multiple
times in ways that the proposals I just made wouldn't necessarily
catch, such as in the case of a self-join. But I think those cases are
rare. In most cases, repeated detoasting is probably very localized,
with the same exact TOAST pointer being de-TOASTed a couple of times
in quick succession, so a query-lifespan cache seems like the wrong
thing. Also, in most cases, repeated detoasting is probably quite
predictable: we could infer it from static analysis of the query. So
throwing ANY kind of cache at the problem seems like the wrong
approach unless the cache is super-cheap, which doesn't seem to be the
case with this proposal, because a cache caters to cases where we
can't know whether we're going to recompute the same result, and here,
that seems like something we could probably figure out.

Because I don't see any way to avoid regressions in the common case
where detoasting isn't repeated, I think this patch is likely never
going to be committed no matter how much time anyone spends on it.

But, to repeat the disclaimer from the top, all of the above is just a
relatively uneducated opinion without any detailed study of the
matter.

...Robert

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

22 May 2024, 01:02:55

Hi Robert, 

> Andy Fan asked me off-list for some feedback about this proposal. I
> have hesitated to comment on it for lack of having studied the matter
> in any detail, but since I've been asked for my input, here goes:

Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.

- which factors we should care more.
- the possiblility to improve the design-A/B for its badness.

But for the point 2, we probably needs some prefer design first.

for now let's highlight that there are 2 totally different method to
handle this problem. Let's call:

design a:  detoast the datum into slot->tts_values (and manage the
memory with the lifespan of *slot*). From me.

design b:  detost the datum into a CACHE (and manage the memory with
CACHE eviction and at released the end-of-query). From Tomas.

Currently I pefer design a, I'm not sure if I want desgin a because a is
really good or just because design a is my own design. So I will
summarize the the current state for the easier discussion. I summarize
them so that you don't need to go to the long discussion, and I summarize
the each point with the link to the original post since I may
misunderstand some points and the original post can be used as a double
check. 

> I agree that there's a problem here, but I suspect this is not the
> right way to solve it. Let's compare this to something like the
> syscache. Specifically, let's think about the syscache on
> pg_class.relname.

OK.

> In the case of the syscache on pg_class.relname, there are two reasons
> why we might repeatedly look up the same values in the cache. One is
> that the code might do multiple name lookups when it really ought to
> do only one. Doing multiple lookups is bad for security and bad for
> performance, but we have code like that in some places and some of it
> is hard to fix. The other reason is that it is likely that the user
> will mention the same table names repeatedly; each time they do, they
> will trigger a lookup on the same name. By caching the result of the
> lookup, we can make it much faster. An important point to note here is
> that the queries that the user will issue in the future are unknown to
> us. In a certain sense, we cannot even know whether the same table
> name will be mentioned more than once in the same query: when we reach
> the first instance of the table name and look it up, we do not have
> any good way of knowing whether it will be mentioned again later, say
> due to a self-join. Hence, the pattern of cache lookups is
> fundamentally unknowable.

True. 

> But that's not the case in the motivating example that started this
> thread. In that case, the target list includes three separate
> references to the same column. We can therefore predict that if the
> column value is toasted, we're going to de-TOAST it exactly three
> times.

True.

> If, on the other hand, the column value were mentioned only
> once, it would be detoasted just once. In that latter case, which is
> probably quite common, this whole cache idea seems like it ought to
> just waste cycles, which makes me suspect that no matter what is done
> to this patch, there will always be cases where it causes significant
> regressions.

I agree.

> In the former case, where the column reference is
> repeated, it will win, but it will also hold onto the detoasted value
> after the third instance of detoasting, even though there will never
> be any more cache hits.

True.

> Here, unlike the syscache case, the cache is
> thrown away at the end of the query, so future queries can't benefit.
> Even if we could find a way to preserve the cache in some cases, it's
> not very likely to pay off. People are much more likely to hit the
> same table more than once than to hit the same values in the same
> table more than once in the same session.

I agree.

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more 
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.

> Suppose we had a design where, when a value got detoasted, the
> detoasted value went into the place where the toasted value had come
> from. Then this problem would be handled pretty much perfectly: the
> detoasted value would last until the scan advanced to the next row,
> and then it would be thrown out. So there would be no query-lifespan
> memory leak.

I think that is same idea as design a.

> Now, I don't really know how this would work, because the
> TOAST pointer is coming out of a slot and we can't necessarily write
> one attribute of any arbitrary slot type without a bunch of extra
> overhead, but maybe there's some way.

I think the key points includes:

1. Where to put the *detoast value* so that ExprEval system can find out
it cheaply. The soluation in A slot->tts_values[*].

2. How to manage the memory for holding the detoast value.

The soluation in A is:

- using a lazy created MemoryContext in slot.
- let the detoast system detoast the datum into the designed
MemoryContext.

3. How to let Expr evalution run the extra cycles when needed.

The soluation in A is:

EEOP_XXX_TOAST step is designed to get riding of some *run-time* *if
(condition)" . I found the existing ExprEval system have many
specialized steps.

I will post a detailed overall design later and explain the its known
badness so far.

> If it could be done cheaply enough, it would gain all the benefits of
> this proposal a lot more cheaply and with fewer
> downsides.

I agree. 

> Alternatively, maybe there's a way 
> to notice the multiple references and introduce some kind of
> intermediate slot or other holding area where the detoasted value can
> be stored.
>
> In other words, instead of computing
>
> big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'
>
> Maybe we ought to be trying to compute this:
>
> big_jsonb_col=detoast(big_jsonb_col)
> big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'
>
> Or perhaps this:
>
> tmp=detoast(big_jsonb_col)
> tmp->'a', tmp->'b', tmp->'c'

Current the design a doesn't use 'tmp' so that the existing ExprEval
engine can find out "detoast datum" easily, so it is more like:

> big_jsonb_col=detoast(big_jsonb_col)
> big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'


[1] https://www.postgresql.org/message-id/87ttllbd00.fsf%40163.com

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

22 May 2024, 02:47:43

Andy Fan <zhihuifan1213@163.com> writes:

> Hi Robert, 
>
>> Andy Fan asked me off-list for some feedback about this proposal. I
>> have hesitated to comment on it for lack of having studied the matter
>> in any detail, but since I've been asked for my input, here goes:
>
> Thanks for doing this! Since we have two totally different designs
> between Tomas and me and I think we both understand the two sides of
> each design, just that the following things are really puzzled to me.

my emacs was stuck and somehow this incompleted message was sent out,
Please ignore this post for now. Sorry for this.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Nikita Malakhov

Date:

22 May 2024, 12:46:21

Hi,

Robert, thank you very much for your response to this thread.

I agree with most of things you've mentioned, but this proposal

is a PoC, and anyway has a long way to go to be presented

(if it ever would) as something to be committed.

Andy, glad you've not lost interest in this work, I'm looking

forward to your improvements!

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: Shared detoast Datum proposal

From

Robert Haas

Date:

22 May 2024, 15:16:01

On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote:
> One more things I want to highlight it "syscache" is used for metadata
> and *detoast cache* is used for user data. user data is more
> likely bigger than metadata, so cache size control (hence eviction logic
> run more often) is more necessary in this case. The first graph in [1]
> (including the quota text) highlight this idea.

Yes.

> I think that is same idea as design a.

Sounds similar.

> I think the key points includes:
>
> 1. Where to put the *detoast value* so that ExprEval system can find out
> it cheaply. The soluation in A slot->tts_values[*].
>
> 2. How to manage the memory for holding the detoast value.
>
> The soluation in A is:
>
> - using a lazy created MemoryContext in slot.
> - let the detoast system detoast the datum into the designed
> MemoryContext.
>
> 3. How to let Expr evalution run the extra cycles when needed.

I don't understand (3) but I agree that (1) and (2) are key points. In
many - nearly all - circumstances just overwriting slot->tts_values is
unsafe and will cause problems. To make this work, we've got to find
some way of doing this that is compatible with general design of
TupleTableSlot, and I think in that API, just about the only time you
can touch slot->tts_values is when you're trying to populate a virtual
slot. But often we'll have some other kind of slot. And even if we do
have a virtual slot, I think you're only allowed to manipulate
slot->tts_values up until you call ExecStoreVirtualTuple.

That's why I mentioned using something temporary. You could legally
(a) materialize the existing slot, (b) create a new virtual slot, (c)
copy over the tts_values and tts_isnull arrays, (c) detoast any datums
you like from tts_values and store the new datums back into the array,
(d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
of the original one. That would avoid repeated detoasting, but it
would also have a bunch of overhead. Having to fully materialize the
existing slot is a cost that you wouldn't always need to pay if you
didn't do this, but you have to do it to make this work. Creating the
new virtual slot costs something. Copying the arrays costs something.
Detoasting costs quite a lot potentially, if you don't end up using
the values. If you end up needing a heap tuple representation of the
slot, you're going to now have to make a new one rather than just
using the one that came straight from the buffer.

So I don't think we could do something like this unconditionally.
We'd have to do it only when there was a reasonable expectation that
it would work out, which means we'd have to be able to predict fairly
accurately whether it was going to work out. But I do agree with you
that *if* we could make it work, it would probably be better than a
longer-lived cache.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

23 May 2024, 00:27:41

Robert Haas <robertmhaas@gmail.com> writes:

> On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote:
>> One more things I want to highlight it "syscache" is used for metadata
>> and *detoast cache* is used for user data. user data is more
>> likely bigger than metadata, so cache size control (hence eviction logic
>> run more often) is more necessary in this case. The first graph in [1]
>> (including the quota text) highlight this idea.
>
> Yes.
>
>> I think that is same idea as design a.
>
> Sounds similar.
>

This is really great to know!

>> I think the key points includes:
>>
>> 1. Where to put the *detoast value* so that ExprEval system can find out
>> it cheaply. The soluation in A slot->tts_values[*].
>>
>> 2. How to manage the memory for holding the detoast value.
>>
>> The soluation in A is:
>>
>> - using a lazy created MemoryContext in slot.
>> - let the detoast system detoast the datum into the designed
>> MemoryContext.
>>
>> 3. How to let Expr evalution run the extra cycles when needed.
>
> I don't understand (3) but I agree that (1) and (2) are key points. In
> many - nearly all - circumstances just overwriting slot->tts_values is
> unsafe and will cause problems. To make this work, we've got to find
> some way of doing this that is compatible with general design of
> TupleTableSlot, and I think in that API, just about the only time you
> can touch slot->tts_values is when you're trying to populate a virtual
> slot. But often we'll have some other kind of slot. And even if we do
> have a virtual slot, I think you're only allowed to manipulate
> slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is.  So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
  like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
  ExecClearTuple and so on. All of the characters have nothing with what
  kind of slot it is.

- Saving the "detoast datum" version into tts_values[*] doesn't break
  the above semantic and the ExprEval engine just get a detoast version
  so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
  all the places where "slot->tts_nvalid" is set to 0 means the
  tts_values[*] is no longer validate, then this is the place we should
  release the memory for the "detoast datum".  All the other places like
  ExecMaterializeSlot or ExecCopySlot also need to think about the
  "detoast datum" as well.

> That's why I mentioned using something temporary. You could legally
> (a) materialize the existing slot, (b) create a new virtual slot, (c)
> copy over the tts_values and tts_isnull arrays, (c) detoast any datums
> you like from tts_values and store the new datums back into the array,
> (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
> of the original one. That would avoid repeated detoasting, but it
> would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

> Having to fully materialize the
> existing slot is a cost that you wouldn't always need to pay if you
> didn't do this, but you have to do it to make this work. Creating the
> new virtual slot costs something. Copying the arrays costs something.
> Detoasting costs quite a lot potentially, if you don't end up using
> the values. If you end up needing a heap tuple representation of the
> slot, you're going to now have to make a new one rather than just
> using the one that came straight from the buffer.

> But I do agree with you
> that *if* we could make it work, it would probably be better than a
> longer-lived cache.

I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose.

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
  harmful for CP_SMALL_TLIST (createplan.c) purpose.
- ExprEval runtime checking if the Var is toastable is an overhead. It
  is must be decided ExecInitExpr time.

The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing for the current strategy, and
the first case show me ~1% *unexpected* performance regression and I
tried to find out it with perf record/report, and no lucky. that's the
reason why I didn't send a complete post yesterday.

As for now, since we are talking about the high level design, I think
it is OK to post the improved design document and the incompleted worst
case testing data and my planning.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The following strategy is used to
reduce the run time overhead.

1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval
   system doesn't need any extra effort to find them.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] &&
        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        if (unlikely(slot->tts_data_mctx == NULL))
            slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
                                     "tts_value_ctx",
                                    ALLOCSET_START_SMALL_SIZES);
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
                                                                 /* save the detoast value to the given MemoryContext.
*/
                                        slot->tts_data_mctx));
    }
}

2. Using a dedicated lazy created memory context under TupleTableSlot so
   that the memory can be released effectively.

static inline void
ExecFreePreDetoastDatum(TupleTableSlot *slot)
{
    if (slot->tts_data_mctx)
        MemoryContextResetOnly(slot->tts_data_mctx);
}

3. Control the memory usage by releasing them whenever the
   slot->tts_values[*] is deemed as invalidate, so the lifespan is
   likely short.

4. Reduce the run-time overhead for checking if a VAR should use
   pre-detoast feature, 3 new steps are introduced.
   EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non
   related.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression in the previous logic.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for
   example an expression like a > 3. However some known expressions is
   ignored. for example: NullTest, pg_column_compression which needs the
   raw datum, start_with/sub_string which needs a slice
   detoasting. Currently there is some hard code here, we may needs a
   pg_proc.detoasting_requirement flags to make this generic.  The
   output is {Scan|Join}.xxx_reference_attrs;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!

Bad case testing of current design:
====================================

- The extra effort of createplan.c & setref.c & execExpr.c & InitPlan :

CREATE TABLE t(a int, b numeric);

q1:
explain select * from t where a > 3;

q2:
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_nestloop to on;

explain select * from t join t t2 using(a);

q3:
set enable_hashjoin to off;
set enable_mergejoin to on;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

q4:
set enable_hashjoin to on;
set enable_mergejoin to off;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

pgbench -f x.sql -n postgres  -M simple -T10

| query ID | patched (ms)                        | master (ms)                         | regression(%) |
|----------+-------------------------------------+-------------------------------------+---------------|
|        1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} |         1.14% |
|        2 | not-done-yet                        |                                     |               |
|        3 | not-done-yet                        |                                     |               |
|        4 | not-done-yet                        |                                     |               |

- The extra effort of ExecInterpExpr

insert into t select i, i FROM generate_series(1, 100000) i;

SELECT count(b) from t WHERE b > 0;

In this query, the extra execution code is run but it doesn't buy anything.

| master | patched | regression |
|--------+---------+------------|
|        |         |            |

What I am planning now is:
- Gather more feedback on the overall strategy.
- figure out the 1% unexpected regression. I don't have a clear
direction for now, my current expression is analyzing the perf report,
and I can't find out the root cause. and reading the code can't find out
the root cause as well.
- Simplifying the "planner part" for deciding which Var should use this
strategy. I don't have clear direction as well.
- testing and improving more worst case.

Attached is a appliable and testingable version against current master
(e87e73245).

--
Best Regards
Andy Fan

Attachment

v10-0001-shared-detoast-datum.patch

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

23 May 2024, 01:46:15

Nikita Malakhov <hukutoc@gmail.com> writes:

> Hi,

> Andy, glad you've not lost interest in this work, I'm looking
> forward to your improvements!

Thanks for your words, I've adjusted to the rhythm of the community and
welcome more feedback:)

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Robert Haas

Date:

23 May 2024, 13:29:08

On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:
> Please give me one more chance to explain this. What I mean is:
>
> Take SELECT f(a) FROM t1 join t2...; for example,
>
> When we read the Datum of a Var, we read it from tts->tts_values[*], no
> matter what kind of TupleTableSlot is.  So if we can put the "detoast
> datum" back to the "original " tts_values, then nothing need to be
> changed.

Yeah, I don't think you can do that.

> - Saving the "detoast datum" version into tts_values[*] doesn't break
>   the above semantic and the ExprEval engine just get a detoast version
>   so it doesn't need to detoast it again.

I don't think this is true. If it is true, it needs to be extremely
well-justified. Even if this seems to work in simple cases, I suspect
there will be cases where it breaks badly, resulting in memory leaks
or server crashes or some other kind of horrible problem that I can't
quite imagine. Unfortunately, my knowledge of this code isn't
fantastic, so I can't say exactly what bad thing will happen, and I
can't even say with 100% certainty that anything bad will happen, but
I bet it will. It seems like it goes against everything I understand
about how TupleTableSlots are supposed to be used. (Andres would be
the best person to give a definitive answer here.)

> - The keypoint is the memory management and effeiciency. for now I think
>   all the places where "slot->tts_nvalid" is set to 0 means the
>   tts_values[*] is no longer validate, then this is the place we should
>   release the memory for the "detoast datum".  All the other places like
>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>   "detoast datum" as well.

This might be a way of addressing some of the issues with this idea,
but I doubt it will be acceptable. I don't think we want to complicate
the slot API for this feature. There's too much downside to doing
that, in terms of performance and understandability.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

24 May 2024, 02:46:38

Robert Haas <robertmhaas@gmail.com> writes:

> On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:
>> Please give me one more chance to explain this. What I mean is:
>>
>> Take SELECT f(a) FROM t1 join t2...; for example,
>>
>> When we read the Datum of a Var, we read it from tts->tts_values[*], no
>> matter what kind of TupleTableSlot is.  So if we can put the "detoast
>> datum" back to the "original " tts_values, then nothing need to be
>> changed.
>
> Yeah, I don't think you can do that.
>
>> - Saving the "detoast datum" version into tts_values[*] doesn't break
>>   the above semantic and the ExprEval engine just get a detoast version
>>   so it doesn't need to detoast it again.
>
> I don't think this is true. If it is true, it needs to be extremely
> well-justified. Even if this seems to work in simple cases, I suspect
> there will be cases where it breaks badly, resulting in memory leaks
> or server crashes or some other kind of horrible problem that I can't
> quite imagine. Unfortunately, my knowledge of this code isn't
> fantastic, so I can't say exactly what bad thing will happen, and I
> can't even say with 100% certainty that anything bad will happen, but
> I bet it will. It seems like it goes against everything I understand
> about how TupleTableSlots are supposed to be used. (Andres would be
> the best person to give a definitive answer here.)
>
>> - The keypoint is the memory management and effeiciency. for now I think
>>   all the places where "slot->tts_nvalid" is set to 0 means the
>>   tts_values[*] is no longer validate, then this is the place we should
>>   release the memory for the "detoast datum".  All the other places like
>>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>>   "detoast datum" as well.
>
> This might be a way of addressing some of the issues with this idea,
> but I doubt it will be acceptable. I don't think we want to complicate
> the slot API for this feature. There's too much downside to doing
> that, in terms of performance and understandability.

Thanks for the doubt! Let's see if Andres has time to look at this. I
feel great about the current state.

--
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

From

Andy Fan

Date:

22 June 2024, 03:10:41

Hi,

Let's me clarify the current state of this patch and what kind of help
is needed.  You can check [1] for what kind of problem it try to solve
and what is the current approach.

The current blocker is if the approach is safe (potential to memory leak
crash etc.).  And I want to have some agreement on this step at least.

"""
When we access [some desired] toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values. so the later ExprState can use the same detoast version
all the time.  this should be more the most effective and simplest way.
"""

IMO, it doesn't break any design of TupleTableSlot and should be
safe. Here is my understanding of TupleTableSlot, please correct me if
anything I missed.

(a). TupleTableSlot define tts_values[*] / tts_isnulls[*] at the high level.
  When the upper layer want a attnum, it just access tts_values/nulls[n].

(b). TupleTableSlot has many different variants, depends on how to get
  tts_values[*] from them, like VirtualTupleTableSlot, HeapTupleTableSlot,
  how to manage its resource (mainly memory) when the life cycle is
  done, for example BufferHeapTupleTableSlot.

(c). TupleTableSlot defines a lot operation against on that, like
getsomeattrs, copy, clear and so on, for the different variants.

the impacts of my proposal on the above factor:

(a). it doesn't break it clearly,
(b). it adds a 'detoast' step for step (b) when fill the tts_values[*],
(this is done conditionally, see [1] for the overall design, we
currently focus the safety). What's more, I only added the step in
EEOP_{SCAN|INNER_OUTER}_VAR_TOAST step, it reduce the impact in another
step.
(c). a special MemoryContext in TupleTableSlot is introduced for the
detoast datum, it is reset whenever the TupleTableSlot.tts_nvalid
= 0. and I also double check every operation defined in
TupleTableSlotOps, looks nothing should be specially handled.

Robert has expressed his concern as below and wanted Andres to have a
look at, but unluckly Andres doesn't have such time so far:(

> Robert Haas <robertmhaas@gmail.com> writes:
>
>> On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:
>>> Please give me one more chance to explain this. What I mean is:
>>>
>>> Take SELECT f(a) FROM t1 join t2...; for example,
>>>
>>> When we read the Datum of a Var, we read it from tts->tts_values[*], no
>>> matter what kind of TupleTableSlot is.  So if we can put the "detoast
>>> datum" back to the "original " tts_values, then nothing need to be
>>> changed.
>>
>> Yeah, I don't think you can do that.
>>
>>> - Saving the "detoast datum" version into tts_values[*] doesn't break
>>>   the above semantic and the ExprEval engine just get a detoast version
>>>   so it doesn't need to detoast it again.
>>
>> I don't think this is true. If it is true, it needs to be extremely
>> well-justified. Even if this seems to work in simple cases, I suspect
>> there will be cases where it breaks badly, resulting in memory leaks
>> or server crashes or some other kind of horrible problem that I can't
>> quite imagine. Unfortunately, my knowledge of this code isn't
>> fantastic, so I can't say exactly what bad thing will happen, and I
>> can't even say with 100% certainty that anything bad will happen, but
>> I bet it will. It seems like it goes against everything I understand
>> about how TupleTableSlots are supposed to be used. (Andres would be
>> the best person to give a definitive answer here.)

I think having a discussion of the design of TupleTableSlots would be
helpful for the progress since in my knowldege it doesn't against
anything of the TupleTaleSlots :(, as I stated above. I'm sure I'm
pretty open on this.

>>
>>> - The keypoint is the memory management and effeiciency. for now I think
>>>   all the places where "slot->tts_nvalid" is set to 0 means the
>>>   tts_values[*] is no longer validate, then this is the place we should
>>>   release the memory for the "detoast datum".  All the other places like
>>>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>>>   "detoast datum" as well.
>>
>> This might be a way of addressing some of the issues with this idea,
>> but I doubt it will be acceptable. I don't think we want to complicate
>> the slot API for this feature. There's too much downside to doing
>> that, in terms of performance and understandability.

Complexity: I double check the code, it has very limitted changes on the
TupleTupleSlot APIs (less than 10 rows).  see tuptable.h.

Performance: by design, it just move the chance of detoast action
*conditionly* and put it back to tts_values[*], there is no extra
overhead to find out the detoast version for sharing.

Understandability: based on how do we understand TupleTableSlot:)

[1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com

--
Best Regards
Andy Fan