On Fri, 1 Aug 2025 00:28:30 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
> Hi,
>
> On Tue, 24 Jun 2025 17:05:33 +0900
> Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> > Instead, I'm thinking of an alternative approach: expanding the expression
> > at the time statistics are collected.
>
> I've attached a new patch in this approache.
>
> This allows to collect statistics on virtual generated columns.
>
> During ANALYZE, generation expressions are expanded, and statistics are computed
> using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
> are now exported from extended_stats.c. However, since they are no longer specific
> to extended statistics, it might be better to move them to analyze.c and vacuum.h.
>
> To enable the optimizer to make use of these statistics, a new field named
> virtual_gencols is added to RelOptInfo. This field holds the expressions of
> virtual generated columns in the table. In examine_variable(), if an expression
> in a WHERE clause matches a virtual generated column, the corresponding statistics
> are used for that expression.
>
> Example:
>
> - Before applying the patch, the cardinality estimate is erroneous.
>
> test=# create table t (i int, j int generated always as (i*10) virtual);
> CREATE TABLE
> test=# insert into t select generate_series(1,1000);
> INSERT 0 1000
> test=# insert into t select 1 from generate_series(1,1000);
> INSERT 0 1000
> test=# analyze t;
> ANALYZE
> test=# explain analyze select * from t where j = 10;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------
> Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
> Filter: ((i * 10) = 10)
> Rows Removed by Filter: 999
> Buffers: shared hit=9
> Planning:
> Buffers: shared hit=10
> Planning Time: 0.299 ms
> Execution Time: 0.948 ms
> (8 rows)
>
>
> - After applying the patch, the cardinality estimate is correct.
>
> test=# analyze t;
> ANALYZE
> test=# explain analyze select * from t where j = 10;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
> Filter: ((i * 10) = 10)
> Rows Removed by Filter: 999
> Buffers: shared hit=9
> Planning:
> Buffers: shared hit=6
> Planning Time: 0.374 ms
> Execution Time: 1.028 ms
> (8 rows)
>
>
> Note that the patch is still a work in progress, so documentation and tests are not included.
I've attached an updated patch.
I modified the documentation to remove the statement that virtual generated columns
do not have statistics.
In addition, I added a test to ensure that statistics on virtual generated columns
are available.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>