Re: Should we optimize the `ORDER BY random() LIMIT x` case? - Mailing list pgsql-hackers
From | Aleksander Alekseev |
---|---|
Subject | Re: Should we optimize the `ORDER BY random() LIMIT x` case? |
Date | |
Msg-id | CAJ7c6TMaNgQeYVS0G7fPCQLOSs4xCiHuZ+sGXbRJihTzCcvvaw@mail.gmail.com Whole thread Raw |
In response to | Should we optimize the `ORDER BY random() LIMIT x` case? (Aleksander Alekseev <aleksander@timescale.com>) |
Responses |
Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Re: Should we optimize the `ORDER BY random() LIMIT x` case? Re: Should we optimize the `ORDER BY random() LIMIT x` case? |
List | pgsql-hackers |
Tom, Nico, Vik, > Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some > of the tablesample methods) to allow it to work on a subquery. Currently our BERNOULLI and SYSTEM sampling methods don't allow specifying the maximum number of rows that should be returned, only the percentage. So it's going to do the same as: ``` SELECT t.* FROM mytable AS t INNER JOIN LATERAL ( SELECT 1 WHERE random() < 0.99 ) AS r ON TRUE ``` Just to clarify, do you propose to add a RESERVOIR sampling method as well? > I assume it gathers the whole result and _then_ orders by random(), yes? Sure. > I think the obvious thing to do is to have the optimizer recognize > `ORDER BY random() LIMIT x` and do the reservoir sampling thing as the > underlying query produces rows. The engine needs to keep track of how > many rows it's seen so far so that with each new row it can use > `random() < 1/n` to decide whether to keep the row. I agree this would be most convenient for the user. Unfortunately this will require us to check every SELECT query: "oh, isn't it by any chance ORDER BY random() LIMIT x?". I don't think we can't afford such a performance degradation, even a small one, for an arguably rare case. > > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl > > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be > > suboptimal even if we supported global temporary tables. > > Speaking of which, what would it take to have global temp tables? Considering the fact that several people (me included [1]) tried to address this issue in several ways for at least the past decade without success, I would say this is a fairly difficult problem. I remember seeing a global temp tables patch on the commitfests maybe a year or two ago, but it's not there anymore. Check the past commitfest to find out how it ended up. > TABLESAMPLE is hitched to a <table primary> which can be basically > anything resembling a relation. So it appears the standard already > allows this and we just need to implement it. Vik, many thanks for sharing this. I don't have a strong opinion on `FETCH SAMPLE FIRST 10 ROWS ONLY` but since TABLESAMPLE should / may work for subqueries anyway we could focus on it for now. *** As a separate observation, I noticed that we could do something like this: ``` -- imagine replacing inefficient array_sample(array_agg(t), 10) -- with more efficient array_sample_reservoir(t, 10) SELECT (unnest(agg)).* AS k FROM ( SELECT array_sample(array_agg(t), 10) AS agg FROM ( ... here goes the subquery ... ) AS t ); ``` ... if only we supported such a column expansion for not registered records. Currently such a query fails with: ``` ERROR: record type has not been registered ``` It has all the needed records, just fails to display them in a form of separate columns. Note that in this case I don't care if column names are going to be something like '???' or 'unnamed1'. I can imagine how it could be convenient in more cases than just this one. Any thoughts on whether we should support this? [1]: https://www.postgresql.org/message-id/flat/20160729141552.4062e19b%40fujitsu -- Best regards, Aleksander Alekseev
pgsql-hackers by date: