Re: Should we optimize the `ORDER BY random() LIMIT x` case? - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Date
Msg-id CAJ7c6TMaNgQeYVS0G7fPCQLOSs4xCiHuZ+sGXbRJihTzCcvvaw@mail.gmail.com
Whole thread Raw
In response to Should we optimize the `ORDER BY random() LIMIT x` case?  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Re: Should we optimize the `ORDER BY random() LIMIT x` case?
List pgsql-hackers
Tom, Nico, Vik,

> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
> of the tablesample methods) to allow it to work on a subquery.

Currently our BERNOULLI and SYSTEM sampling methods don't allow
specifying the maximum number of rows that should be returned, only
the percentage. So it's going to do the same as:

```
SELECT t.* FROM mytable AS t
INNER JOIN LATERAL (
    SELECT 1 WHERE random() < 0.99
) AS r ON TRUE
```

Just to clarify, do you propose to add a RESERVOIR sampling method as well?

> I assume it gathers the whole result and _then_ orders by random(), yes?

Sure.

> I think the obvious thing to do is to have the optimizer recognize
> `ORDER BY random() LIMIT x` and do the reservoir sampling thing as the
> underlying query produces rows.  The engine needs to keep track of how
> many rows it's seen so far so that with each new row it can use
> `random() < 1/n` to decide whether to keep the row.

I agree this would be most convenient for the user. Unfortunately this
will require us to check every SELECT query: "oh, isn't it by any
chance ORDER BY random() LIMIT x?". I don't think we can't afford such
a performance degradation, even a small one, for an arguably rare
case.

> > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
> > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
> > suboptimal even if we supported global temporary tables.
>
> Speaking of which, what would it take to have global temp tables?

Considering the fact that several people (me included [1]) tried to
address this issue in several ways for at least the past decade
without success, I would say this is a fairly difficult problem. I
remember seeing a global temp tables patch on the commitfests maybe a
year or two ago, but it's not there anymore. Check the past commitfest
to find out how it ended up.

> TABLESAMPLE is hitched to a <table primary> which can be basically
> anything resembling a relation.  So it appears the standard already
> allows this and we just need to implement it.

Vik, many thanks for sharing this. I don't have a strong opinion on
`FETCH SAMPLE FIRST 10 ROWS ONLY` but since TABLESAMPLE should / may
work for subqueries anyway we could focus on it for now.

***

As a separate observation, I noticed that we could do something like this:

```
-- imagine replacing inefficient array_sample(array_agg(t), 10)
-- with more efficient array_sample_reservoir(t, 10)
SELECT (unnest(agg)).* AS k FROM
(  SELECT array_sample(array_agg(t), 10) AS agg FROM (
   ... here goes the subquery ...
   ) AS t
);
```

... if only we supported such a column expansion for not registered
records. Currently such a query fails with:

```
ERROR:  record type has not been registered
```

It has all the needed records, just fails to display them in a form of
separate columns. Note that in this case I don't care if column names
are going to be something like '???' or 'unnamed1'. I can imagine how
it could be convenient in more cases than just this one.

Any thoughts on whether we should support this?

[1]: https://www.postgresql.org/message-id/flat/20160729141552.4062e19b%40fujitsu
-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Backward movement of confirmed_flush resulting in data duplication.
Next
From: Robert Haas
Date:
Subject: Re: Violation of principle that plan trees are read-only