Re: WIP: Collecting statistics on CSV file data - Mailing list pgsql-hackers
From | Shigeru HANADA |
---|---|
Subject | Re: WIP: Collecting statistics on CSV file data |
Date | |
Msg-id | 4F7D8BBF.4050004@gmail.com Whole thread Raw |
In response to | Re: WIP: Collecting statistics on CSV file data (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>) |
Responses |
Re: WIP: Collecting statistics on CSV file data
|
List | pgsql-hackers |
(2011/12/13 22:00), Etsuro Fujita wrote: > Thank you for your effectiveness experiments and proposals for > improvements. I updated the patch according to your proposals. > Attached is the updated version of the patch. Hi all, I've revised the v6.1 patch and created v7 patch, though dead line of this CF is coming closer. I think that this feature provides a way to improve plans for foreign tables significantly, so I hope that this feature is available in 9.2. I'd like to show overview of the patch again for ease of review. New FDW API function ==================== This patch adds new FDW API function AnalyzeForeignTable to FdwRoutine which can be used to support updating local statistics of foreign table. This function is invoked when ANALYZE command is executed against a foreign table explicitly. This handler function is optional, so if underlying FDW set NULL for the pointer, PostgreSQL doesn't touch statistics but emits a message about skipping. It's not required to FDWs to implement fully featured analyzer by itself. They can use core routines, do_analyze_rel and others, for most difficult part of analyzing. What FDWs should do is to provide a sampling function and call do_analyze_rel with passing the sampling function as argument in their concrete AnalyzeForeignTable. AnalyzeForeignTable (or sampling function) can report FDW-specific additional information by calling ereport() with given elevel. Once we've considered an idea that FDW stores statistics information without calling do_analyze_rel, but it seems very hard to implement, and not so efficient. I tried to implement such handler in pgsql_fdw (which seeems easiest to achieve) by getting remote statistics by "SELECT * FROM pg_statistics", but it has several issues such as: 1) Highly privileged user on remote side should be mapped to ANALYZE invoker 2) Structure and semantics might be difficult on remote side, if versions are not same. 3) We need to convert anyarray to anyarray through text representation. We know type of elements, but bothering works are needed. DDL changes =========== ALTER FOREIGN TABLE supports SET STATISTICS clause and n_distinct option. Former changes per-attribute statistics target, and latter overrides calculated statistics. n_distinct_inherited is not available because foreign tables can't be inherited. psql support ============ psql completes foreign table names after the keyword "ANALYZE" in addition to ordinary tables. Of course newly added statistics target is shown in \d+ command. file_fdw ======== This patch contains a use case of new handler function in contrib/file_fdw. Since file_fdw reads data from a flat file, fileAnalyzeForeignTable uses similar algorithm to ordinary tables; it samples first N rows first, and replaces them randomly with subsequent rows. Also file_fdw updates pg_class.relpages by calculating number of pages from size of the data file. To allow FDWs to implement sampling argorighm like this, several functions are exported from analyze.c, e.g. random_fract, init_selection_state, and get_next_S. pgsql_fdw ========= Though it's not fully finished, I've implemented ANALYZE handler for pgsql_fdw. Please extract pgsql_fdw.tar.gz into contrib (or use pgxs) to try it. Regards, -- Shigeru HANADA
Attachment
pgsql-hackers by date: