Re: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Showing applied extended statistics in explain Part 2
Date
Msg-id 80e1a20a-19e1-427a-87d7-a39d012ff843@gmail.com
Whole thread Raw
In response to Re: Showing applied extended statistics in explain Part 2  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On 2/10/25 16:56, Tomas Vondra wrote:
> On 2/10/25 10:09, Andrei Lepikhov wrote:
>> On 8/2/2025 20:50, Tomas Vondra wrote:
>> <StatisticsInfo LastUpdate="2024-09-09T21:55:04.43"
>> ModificationCount="0" SamplingPercent="17.9892"
>> Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]"
>> Schema="[dbo]" Database="[DB]"></StatisticsInfo>
>>
>> The main profit here - you see all the stats involved in estimations
>> (and their state), even if final plan doesn't contain estimated stuff at
>> all.
> 
> OK, that seems very underwhelming. I still think we should show which
> clauses were estimated using which statistics object.
To understand how it may work, I employed the EXPLAIN extensibility 
introduced in PG 18 to show the use of plain statistics [1]. It looks 
like the following:

EXPLAIN (COSTS OFF, STAT ON)
SELECT * FROM sc_a WHERE x=1 AND y LIKE 'a';

  Seq Scan on sc_a
    Filter: ((y ~~ 'a'::text) AND (x = 1))
  Statistics:
    "sc_a.y: 1 times, stats: { MCV: 10 values, Correlation,
        ndistinct: 10.0000, nullfrac: 0.0000, width: 5 }
    "sc_a.x: 1 times, stats: { Histogram: 0 values, Correlation,
         ndistinct: -1.0000, nullfrac: 0.0000, width: 4 }

As you can see, stat usage is summarised at the end of the EXPLAIN. It 
contains information about the column, how many times it was used and 
the parameters of statistic slots.
Of course, being an extension it is constrained a lot, but even there is 
the profit:
1. You may see types of statistics exist on the column
2. Precision of the histogram or MCV (statistic_target) on a specific 
table - some users forget to increase it on large (or partitioned) tables
3. You have basic stat like nullfrac, ndistinct without the necessity to 
teach personnel how to gather it on a production instance safely.

Also, using it in real cases, I realised that it would be highly 
profitable to specify which statistic type was used to estimate this 
specific clause.

Of course, extended statistics have their own specifics, which may 
require another output format. Just consider this example too.

[1] https://github.com/danolivo/pg_index_stats

-- 
regards, Andrei Lepikhov



pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: Disallow redundant indexes
Next
From: "Vitaly Davydov"
Date:
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly