wenhui, Andrei,
> if we can optimize the query, that would be great
Thanks for the feedback!
> I think I got your point, but just to be sure:
> Do you want to have some random sampling from an arbitrary subquery with
> the guarantee that N distinct (by tid) tuples will be produced or all
> the tuples if the underlying subquery produces less than N?
Right.
> What kind of optimisation trick may the optimiser use here to provide an
> optimal plan? As I see it, it will need to think that all the tuples
> should be returned from the subquery. The only profit is to skip sorting
> the massive sample.
Doesn't look like a generic optimization trick will help us. I was
thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
...`. However I doubt that aggregate functions are flexible enough. Or
alternatively a rewrite rule. I never dealt with those before so I
have no idea what I'm talking about :D
> As a palliative, you may laterally join your subquery with a stored
> procedure, which will process the incoming tuple and implement the logic
> of random sampling.
Yes, that's a good observation. The query:
```
SELECT t.* FROM mytable AS t
INNER JOIN LATERAL (
SELECT 1 WHERE random() > 0.99
) AS r ON TRUE
```
Semples 1% of the tuples efficiently. I wonder if there is also an
efficient way to sample "no more than N" tuples without knowing the
size of the entire set / stream beforehand. Implementing such an
algorithm in C is not difficult [1]. Doesn't seem so in SQL.
> Implementation of that in the core will need a new "skip result" node
> and new syntax, which may be too much if a workaround is found.
Agree. I'm not too worried about the new node. The new syntax
apparently is going to be mandatory. Otherwise we will have to check
"wait what if it's ORDER BY random() LIMIT?" for every single query. I
wonder what the community's opinion on having such a syntax is.
[1] e.g. https://samwho.dev/reservoir-sampling/
--
Best regards,
Aleksander Alekseev