Thread: `order by random()` makes select-list `random()` invocations deterministic
In Postgres 16.1, running the following query: ``` select gs, gs + random() * 100 - 50 as gs2, random() * 100 - 50 as r1, random() * 100 - 50 as r2, random() * 100 - 50 as r3 from generate_series(0, 10) as gs order by random(); ``` Every `random()` invocation in the select list uses a single consistent value within each returned row. Remove the `order by random()` and values become randomized as expected, but it gets a bit stranger: - order by any of the `rN` values and `gs2` uses a different random value, but all the `rN` values are identical; - order by `gs2` and all random values are distinct again.
Re: `order by random()` makes select-list `random()` invocations deterministic
From
"David G. Johnston"
Date:
On Wednesday, February 28, 2024, Dian Fay <di@nmfay.com> wrote:
Every `random()` invocation in the select list uses a single consistent
value within each returned row. Remove the `order by random()` and
values become randomized as expected, but it gets a bit stranger:
This seems to be one of those “won’t fix” bugs that stems from the parser being a bit too liberal in what it accepts as valid SQL.
The fact that the gs involving expression is seen differently than the ones not involving gs doesn’t surprise me.
David J.
Re: `order by random()` makes select-list `random()` invocations deterministic
From
"Dian Fay"
Date:
On Thu Feb 29, 2024 at 12:53 AM EST, David G. Johnston wrote: > On Wednesday, February 28, 2024, Dian Fay <di@nmfay.com> wrote: > > > > > > Every `random()` invocation in the select list uses a single consistent > > value within each returned row. Remove the `order by random()` and > > values become randomized as expected, but it gets a bit stranger: > > > > > This seems to be one of those “won’t fix” bugs that stems from the parser > being a bit too liberal in what it accepts as valid SQL. > > > https://www.postgresql.org/message-id/CAKFQuwZ3-XGfcS%2BCLTAYvPx3ARYjUxv%2B%3DYL8sOicV0nda%3DT5cA%40mail.gmail.com > > The fact that the gs involving expression is seen differently than the ones > not involving gs doesn’t surprise me. > > David J. Thanks, that makes sense! Any thoughts on whether it's worth a cautionary note in the `order by` and/or random function docs since `order by random()` is a fairly well attested solution and I'm not the first person to run into this quirk?
"Dian Fay" <di@nmfay.com> writes: > Thanks, that makes sense! Any thoughts on whether it's worth a > cautionary note in the `order by` and/or random function docs since > `order by random()` is a fairly well attested solution and I'm not the > first person to run into this quirk? The question is where to put the caution, because it's not like this is somehow specific to random(). I believe what's fundamentally happening here is that ordinarily, functions in the targetlist are evaluated during the final projection step that occurs after the scan/join phase. So for example: regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl; QUERY PLAN ---------------------------------- Seq Scan on public.int4_tbl Output: f1, random(), random() (2 rows) regression=# select f1, random(), random() from int4_tbl; f1 | random | random -------------+---------------------+--------------------- 0 | 0.6623584085865575 | 0.8611211203466376 123456 | 0.7371018974566144 | 0.707800598912321 -123456 | 0.5652778572736816 | 0.2684184354600243 2147483647 | 0.47747487098101504 | 0.531163579706837 -2147483647 | 0.3829707208069777 | 0.11094005213737002 (5 rows) The two textually distinct occurrences of random() are evaluated separately. However, if you use that same function as a sort key: regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl order by random(); QUERY PLAN ----------------------------------------- Result Output: f1, (random()), (random()) -> Sort Output: f1, (random()) Sort Key: (random()) -> Seq Scan on public.int4_tbl Output: f1, random() (7 rows) regression=# select f1, random(), random() from int4_tbl order by random(); f1 | random | random -------------+---------------------+--------------------- -123456 | 0.39761234612097884 | 0.39761234612097884 123456 | 0.7227920193261217 | 0.7227920193261217 0 | 0.7628733460552672 | 0.7628733460552672 2147483647 | 0.8745622687164281 | 0.8745622687164281 -2147483647 | 0.929020084273001 | 0.929020084273001 (5 rows) Reading that explain plan requires a certain amount of expertise, but the key point is that the extra parens around the upper-level "random()" expressions indicate that the value is being copied up from the subplan rather than being evaluated afresh. So the need to include "random()" in what passes through the sort step results in the final projection just re-using that value in both places, because the targetlist entries (and subexpressions thereof) are matched literally against what is coming out of the sort step without consideration for whether those expressions are volatile and what should happen if they are. The same would happen with a GROUP BY expression. So that's why it happens like that from an implementation standpoint, but that doesn't inform us much about what we might like to do differently. I think the current behavior arose from wanting to support SQL92 syntax like select random() from mytab order by 1; Here it should surprise nobody if the output appears in sorted order --- in fact, I bet people would say it's a bug if it doesn't. However, SQL99 got rid of that syntax and would have us write select random() from mytab order by random(); Here it's very much less obvious whether the two occurrences of random() are meant to denote the same value. PG has historically taken the position that they are, mainly because the SQL92 behavior clearly has some use and you can't readily get that with the SQL99 syntax unless you read it this way. However, having taken that position, it's hard to argue that select random(), random() from mytab order by random(); shouldn't result in all three instances meaning the same value. There's an argument certainly that we should require you to write it differently if you want that behavior, probably by having just one random() instance in a sub-select. But I doubt we'd get a lot of kudos for changing a behavior that's stood for a couple of decades. Maybe we could write something like "An expression or subexpression in the SELECT list that matches an ORDER BY or GROUP BY item is taken to represent the same value that was sorted or grouped by, even when the (sub)expression is volatile". I'm not sure where to put this though. (TBH, I'm also not very sure that we honor that in absolutely every case ...) regards, tom lane