Home > mailing lists

Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

From	Konstantin Knizhnik
Subject	Re: Columns correlation and adaptive query optimization
Date	January 25, 2021 13:27:25
Msg-id	318656e2-4d5d-2865-f1ba-a5c705327ac8@postgrespro.ru Whole thread Raw
In response to	Re: Columns correlation and adaptive query optimization (Yugo NAGATA <nagata@sraoss.co.jp>)
Responses	Re: Columns correlation and adaptive query optimization
List	pgsql-hackers

Tree view

Hello,

Thank you for review.
My answers are inside.

On 21.01.2021 15:30, Yugo NAGATA wrote:

Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Attached please find new version of the patch with more comments and 
descriptions added.

Adaptive query optimization is very interesting feature for me, so I looked
into this patch.  Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

Rebased version is attached.


(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other.  I wonder it would be better to split the patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although without clausesel patch additional statistic can not improve query planning.
But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?


(3)
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

Done.


(4)
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+			table_close(stat_rel, AccessExclusiveLock);
+		}

When I tested the auto_explain part, I got the following WARNING.
 WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2) WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1) WARNING:  relcache reference leak: relation "pg_statistic_ext" not closed WARNING:  TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced WARNING:  Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

Thank you for noticing the problem, fixed.


(6)
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);

We should use ereport instead of elog for log messages.

Changed.


(7)
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}

I found the following comment of clauselist_apply_dependencies():
  * we actually combine selectivities using the formula  *  *      P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is, 
   s1 *= dep + (1-dep) * s2                   (if s1 <= s2)  s1 *= dep * (s2/s1) + (1-dep) * s2   (otherwise) .

Makes sense.

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

Fixed.


(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

Right now auto_explain create statistic without explicit specification of statistic kind.
According to the documentation all supported statistics kinds should be created in this case:

statistics_kind

A statistics kind to be computed in this statistics object. Currently supported kinds are ndistinct, which enables n-distinct statistics, and dependencies, which enables functional dependency statistics. If this clause is omitted, all supported statistics kinds are included in the statistics object. For more information, see Section 14.2.2 and Section 68.2.


(10)
To achieve adaptive query optimization  (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing half of the required work (analyze expensive statements).
It can be certainly moved to separate extension. In this case it will party duplicate existed functionality and
settings of auto_explain (like statement execution time threshold). I am not sure that it is good.
But from the other side, this my patch makes auto_explain extension to do some unexpected work...

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to correctly adjust estimations.
This my patch is much simpler and use existed mechanism (extended statistics) to improve estimations.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

From: Greg Nancarrow
Date: 25 January 2021, 13:14:14
Subject: Re: Parallel INSERT (INTO ... SELECT ...)

From: Guillaume Lelarge
Date: 25 January 2021, 13:34:02
Subject: Re: Extensions not dumped when --schema is used

Re: Columns correlation and adaptive query optimization - Mailing list pgsql-hackers

Attachment

Previous

Next