TABLESAMPLE patch is really in pretty sad shape - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | TABLESAMPLE patch is really in pretty sad shape |
Date | |
Msg-id | 12048.1436646520@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: TABLESAMPLE patch is really in pretty sad shape
Re: TABLESAMPLE patch is really in pretty sad shape |
List | pgsql-hackers |
The two contrib modules this patch added are nowhere near fit for public consumption. They cannot clean up after themselves when dropped: regression=# create extension tsm_system_rows; CREATE EXTENSION regression=# create table big as select i, random() as x from generate_series(1,1000000) i; SELECT 1000000 regression=# create view v1 as select * from big tablesample system_rows(10); CREATE VIEW regression=# drop extension tsm_system_rows; DROP EXTENSION The view is still there, but is badly broken: regression=# select * from v1; ERROR: cache lookup failed for function 46379 Potentially this is a security issue, since a malicious user could probably manage to create a Trojan horse function having the now-vacated OID, whereupon use of the view would invoke that function. Worse still, the broken pg_tablesample_method row is still there: regression=# create extension tsm_system_rows; ERROR: duplicate key value violates unique constraint "pg_tablesample_method_name_index" DETAIL: Key (tsmname)=(system_rows) already exists. And even if we fixed that, these modules will not survive a pg_upgrade cycle, because pg_upgrade has no idea that it would need to create a pg_tablesample_method row (remember that we do *not* replay the extension script during binary upgrade). Raw inserts into system catalogs just aren't a sane thing to do in extensions. Some of the risks here come from what seems like a fundamentally wrong decision to copy all of the info about a tablesample method out of the pg_tablesample_method catalog *at parse time* and store it permanently in the query parse tree. This makes any sort of "alter tablesample method" DDL operation impossible in principle, because any views/rules referencing the method have already copied the data. On top of that, I find the basic implementation design rather dubious, because it supposes that the tablesample filtering step must always come first; the moment you say TABLESAMPLE you can kiss goodbye the idea that the query will use any indexes. For example: d2=# create table big as select i, random() as x from generate_series(1,10000000) i; SELECT 10000000 d2=# create index on big(i); CREATE INDEX d2=# analyze big; ANALYZE d2=# explain analyze select * from big where i between 100 and 200; QUERYPLAN --------------------------------------------------------------------------------------------------------------------Index Scanusing big_i_idx on big (cost=0.43..10.18 rows=87 width=12) (actual time=0.022..0.088 rows=101 loops=1) Index Cond:((i >= 100) AND (i <= 200))Planning time: 0.495 msExecution time: 0.141 ms (4 rows) d2=# explain analyze select * from big tablesample bernoulli(10) where i between 100 and 200; QUERY PLAN --------------------------------------------------------------------------------------------------------------------Sample Scan(bernoulli) on big (cost=0.00..54055.13 rows=9 width=12) (actual time=0.028..970.051 rows=13 loops=1) Filter: ((i >=100) AND (i <= 200)) Rows Removed by Filter: 999066Planning time: 0.182 msExecution time: 970.093 ms (5 rows) Now, maybe I don't understand the use-case for this feature, but I should think it's meant for dealing with tables that are so big that you can't afford to scan all the data. So, OK, a samplescan is hopefully cheaper than a pure seqscan, but that doesn't mean that fetching 999079 rows and discarding 999066 of them is a good plan design. There needs to be an operational mode whereby we can use an index and do random sampling of the TIDs the index returns. I do not insist that that has to appear in version one of the feature --- but I am troubled by the fact that, by exposing an oversimplified API for use by external modules, this patch is doubling down on the assumption that no such operational mode will ever need to be implemented. There are a whole lot of lesser sins, such as documentation that was clearly never copy-edited by a native speaker of English, badly designed planner APIs (Paths really ought not have rowcounts different from the underlying RelOptInfo), potential core dumps due to dereferencing values that could be null (the guards for null values are in the wrong places entirely), etc etc. While there's nothing here that couldn't be fixed by nuking the contrib modules and putting a week or two of concentrated work into fixing the core code, I for one certainly don't have time to put that kind of effort into TABLESAMPLE right now. Nor do I really have the interest; I find this feature of pretty dubious value. What are we going to do about this? regards, tom lane
pgsql-hackers by date: