Re: tuple radix sort - Mailing list pgsql-hackers

From John Naylor
Subject Re: tuple radix sort
Date
Msg-id CANWCAZaKRzsYPn9YWRR8g5dkX-ep_jgyUb4r3KFekK5ZTNo7ew@mail.gmail.com
Whole thread Raw
In response to Re: tuple radix sort  (Chao Li <li.evan.chao@gmail.com>)
List pgsql-hackers
On Thu, Nov 27, 2025 at 2:49 PM Chao Li <li.evan.chao@gmail.com> wrote:
> I did an initial test before, but I didn’t read the code at the time. Today, I spent time reviewing 0001. Overall, I
believethe implementation is solid. I just got a few comments/suggestions. 

Thanks for looking!

> > On Nov 26, 2025, at 21:11, John Naylor <johncnaylorls@gmail.com> wrote:

> >
<v5-0001-Use-radix-sort-when-SortTuple-contains-a-pass-by-.patch><v5-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patch><v5-0003-WIP-make-some-regression-tests-sort-order-more-de.patch><v5-0002-WIP-Adjust-regression-tests.patch>

> We recompute normalize_datum(tup->datum1, ssup) for every tuple in every level, why don’t cache the result in
SortTuple.As we have cached current_byte in SortTuple, it shouldn’t be a big deal to add one more field to it. 

Actually it is a big deal, because the memtuples array counts against work_mem.

I saw two ways to store the full normalized without increasing the
size, and I've already rejected them:
- share space with isnull1/srctape -- v3 did this, and I already
explained the reason for changing when I shared v4.
- share space with datum1 -- that would require additional code to
restore the original datum and makes it more difficult to verify
correctness

There is also a proposal upthread to not store anything, and that's
still up in the air.

> 2 - 0001
> ```
> +       while (num_remaining > 1)
> +       {
> +               /* start the count over */
> +               num_remaining = num_partitions;
> ```
>
> The swap loop always start the count over, so that sorted partitions are re-scanned as well. I think we can do an
optimizationlike: 
>
> ```
> num_active = num_partitions;
> while (num_active > 1)
> {
>     for (int i = 0; i < num_active; )
>     {
>         uint8 idx = remaining_partitions[i];
>         // do the swaps for the partition …
>
>         if (partitions[idx].offset == partitions[idx].next_offset)
>         {
>             remaining_partitions[i] = remaining_partitions[num_active - 1];
>             num_active--;
>         }
>         else
>             i++;
>     }
> }
> ```
>
> This way we move out sorted partitions, so they will not be re-scanned.

I don't think that's going to work without additional bookkeeping, so
I'm not sure it's worth it. See the recursion step.

> 3 - 0001
>
> In sort_byvalue_datum, we can add a fast-path for all NULL and all NOT NULL cases, so that they won’t need to run the
branchlesscyclic permutation and the two “for” loops of assertions. Something like: 

This is too clever and yet doesn't go far enough.

There is already one fast path, which happens to cover the common ASC
+ all NOT NULL case. The right way to skip the permutation step would
be to add a second loop that starts from the end and stops at the
first tuple that needs to swap. That should work not just for all NULL
and all NOT NULL, but also where there is a mix of the two and some
(or all) are already in place. All these cases can be treated the same
and they will continue to the exact same paths.

I haven't yet bothered to try harder, but it may be necessary in order
to avoid introducing regressions, so I'll look into it.

--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: Soumya S Murali
Date:
Subject: Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Remove unused function parameters, part 1: contrib