Re: Should we optimize the `ORDER BY random() LIMIT x` case? - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Date
Msg-id CAJ7c6TPrYx82FXPmms7xmHpsfv=P=fReky9aEg5rFYLnt2NXbQ@mail.gmail.com
Whole thread Raw
In response to Should we optimize the `ORDER BY random() LIMIT x` case?  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: Should we optimize the `ORDER BY random() LIMIT x` case?
List pgsql-hackers
wenhui, Andrei,

> if we can optimize the query, that would be great

Thanks for the feedback!

> I think I got your point, but just to be sure:
> Do you want to have some random sampling from an arbitrary subquery with
> the guarantee that N distinct (by tid) tuples will be produced or all
> the tuples if the underlying subquery produces less than N?

Right.

> What kind of optimisation trick may the optimiser use here to provide an
> optimal plan? As I see it, it will need to think that all the tuples
> should be returned from the subquery. The only profit is to skip sorting
> the massive sample.

Doesn't look like a generic optimization trick will help us. I was
thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
...`. However I doubt that aggregate functions are flexible enough. Or
alternatively a rewrite rule. I never dealt with those before so I
have no idea what I'm talking about :D

> As a palliative, you may laterally join your subquery with a stored
> procedure, which will process the incoming tuple and implement the logic
> of random sampling.

Yes, that's a good observation. The query:

```
SELECT t.* FROM mytable AS t
INNER JOIN LATERAL (
    SELECT 1 WHERE random() > 0.99
) AS r ON TRUE
```

Semples 1% of the tuples efficiently. I wonder if there is also an
efficient way to sample "no more than N" tuples without knowing the
size of the entire set / stream beforehand. Implementing such an
algorithm in C is not difficult [1]. Doesn't seem so in SQL.

> Implementation of that in the core will need a new "skip result" node
> and new syntax, which may be too much if a workaround is found.

Agree. I'm not too worried about the new node. The new syntax
apparently is going to be mandatory. Otherwise we will have to check
"wait what if it's ORDER BY random() LIMIT?" for every single query. I
wonder what the community's opinion on having such a syntax is.

[1] e.g. https://samwho.dev/reservoir-sampling/
-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Assert failure in base_yyparse
Next
From: Dilip Kumar
Date:
Subject: Re: Assertion failure in smgr.c when using pg_prewarm with partitioned tables