Re: Import Statistics in postgres_fdw before resorting to sampling. - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Import Statistics in postgres_fdw before resorting to sampling.
Date
Msg-id CADkLM=d1HH1JnhCmpD+3v+A=JZ4EeQRa86Ygfaucb1zKpS1vWw@mail.gmail.com
Whole thread Raw
In response to Re: Import Statistics in postgres_fdw before resorting to sampling.  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
Instead of fetch_stats, I would prefer has_stats. But if majority is
on fetch_stats, I am ok with it.

I don't think there are any strong feelings about the names of the parameters, only their meanings and their defaults.
 
I have no strong opinion about whether fetch_stats should be ON or off
by default. In my simple view, the remote relation that the foreign
table points is usually a regular table or a partitioned table. So
default ON makes sense to me. But there's a possibility that in an
OLAP system most of the foreign objects are views exposing only some
data and not all. I think defaulting to off is backward compatible and
requires cautious opt-in.

If the remote object is a postgres view, we're never going to get stats, so default of fetch_stats = on coupled with automatic failover to regular sampling would result in some trivial network traffic and some NOTICE/WARNING log messages until the owner sets fetch_stats = off on the local table.
 
Or else suddenly all ANALYZE on foreign
server will be costly since they are running a couple queries before
falling back to sampling.

In all scenarios, the remote_analyze setting defaults to false, and the remote database should continue to have stats going forward from that point. I can see value in having a more insistent NOTICE/WARNING message for situations where a remote ANALYZE fails or the table continues to have inadequate stats after the remote ANALYZE.
 
Why do we need remote_analyze as a flag? I might have missed some
conversation which led to including this flag. We fetch relkind, so we
know whether the object on the foreign server supports statistics or
not.

Well, sorta. We know the database on the other end talks libpq and has the table pg_class. There are more than a few forks, Redshift and Vertica come to mind, where this check will succeed, but the design diverges after that and doesn't have pg_statistic or similar. We have to do deeper checks of the remote system (matching pg_statistic rows to all the va_cols we expect, etc) to verify that we _probably_ have good stats, but we don't actually truly know if the stats will actually be there until we fetch and import them, and at that point we're done. And that is part of my discomfort with StatisticsAreImportable in that it can identify some of the cases where we _can't_ import stats, but to identify whether we _can_ import the stats we have just moved all of the functionality of ImportStatistics into StatisticsAreImportable.
 
If it supports statistics, we run ANALYZE automatically. I mean
the user wants us to ANALYZE the foreign table and has also indicated
that the remote object is capable of storing stats. We should document
the fact that we analyze if we do not find existing statistics. Why
would users want to disable running ANALYZE? One possibility is they
don't want to disturb query plans because of suddenly changing
statistics. But then they will need to switch auto-analysis off. So
maybe there is a way to know when it's not acceptable to run ANALYZE
on the remote server. I feel this flag will put another burden on the
user who already has to set so many flags in postgres_fdw. That is
especially true, when the flag is useful only the first time when the
remote object doesn't have stats - usually there will be stats. If at
all we have to introduce remote_analyze, let it default to the value
of fetch_stats.

I may have answered this question in the above responses, but remote_analyze does not default to ON, nor do I envision making it the default. I think the primary purpose for the flag is situations where the remote database is often "swapped out" with one that has a fresher set of the data (via DNS swapping, port swapping, etc) and the new database was put online before vacuum got around to those tables. I would agree that running the remote analyze is pulling a big lever, hence only trying it if we failed to find sufficient existing stats.
 
If both - importing statistics and fallback analysis - fail, I think
we should resort to sampling - since the user has asked for an ANALYZE
table and sampling will achieve that. As you said, giving warnings at
every failure will nudge the user to set their options correctly.

Sorry if my description was confusing, I should have been more clear that "fallback analysis" is the regular row sampling. In scenario #4 (the one I advocate and Nathan advocate[d?]), the decision tree is this:

1. if fetch_stats = off OR StatisticsAreImportable = false THEN resort_to_regular_row_sampling and exit normally.
2. attempt_to_fetch_remote_stats. If we fetched and imported remote stats, THEN exit normally.
3. if remote_analyze = false THEN resort_to_regular_row_sampling and exit normally.
4. send_remote_analyze_command, if that fails then THEN WARN and resort_to_regular_row_sampling and exit normally.
5. (step #2, again, essentially) attempt_to_fetch_remote_stats. If we fetched and imported remote stats, THEN exit normally.
6. WARN user of potentially intractable problem, resort_to_regular_row_sampling and exit normally.

pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: remove the unneeded header file math.h in binaryheap.c
Next
From: Kirill Reshke
Date:
Subject: Re: Fix gistkillitems & add regression test to microvacuum