Thread: Re: [PATCHES] Eliminate more detoast copies for packed varlenas
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> Ok, this removes what should be most if not all of the call sites where we're >> detoasting text or byteas. In particular it gets all the regexp/like functions >> and all the trim/pad functions. It also gets hashtext and hash_any. > > Applied with some fixes --- you'd missed like_match.c, which doubtless > explains Guillame's complaint that it didn't work ... Strange. It passed all regression tests for me and it seems like this is something that would have been caught even in single-byte mode by a simple test. It seems to me that like_match.c only used for SIMILAR is that right? That would explain it as there don't appear to be any tests of SIMILAR. Separately: > Although I'd misdiagnosed the reason for the recent failures on buildfarm > member grebe, I see no reason to revert the 1-byte-header-friendly changes I > made in varlena.c. Instead, tweak the code a little bit to get more > advantage out of that. This may have been a misdiagnosis of the buildfarm failures but it looks like a correct bug fix to me. That is, it looks like regexp_split_to_array() was capable of passing a packed varlena to setup_regexp_matches which wasn't expecting it. But this I don't understand why it wouldn't cause regression failures and indeed when I tested it with my build it didn't cause any problems. This all brings up the question of what other files should be considered for fixing. There is the possibility that some of the other sites could turn out to be performance critical for a given application. In particular I'm primarily concerned with calls which could be invoked during a data load or index build. Of the following list it seems to me the calls in adt/acl.c are probably interesting to fix. Especially since it seems we could just replace all those text* parameters with Datum parameters since they're only going to be passed to textin anyways. After that the tsearch and xml data types are probably interesting, and the various data type input functions and contrib gist/gin operator classes. I'm fine doing the drudge work but wanted to check before I do it that I'm not doing something we don't think is worth doing at this point in time. src/backend/access/transam/xlog.c:3 src/backend/catalog/pg_conversion.c:2 src/backend/commands/sequence.c:1 src/backend/libpq/be-fsstubs.c:2 src/backend/tsearch/dict.c:3 src/backend/tsearch/to_tsany.c:6 src/backend/tsearch/wparser.c:6 src/backend/utils/adt/acl.c:61 src/backend/utils/adt/char.c:1 src/backend/utils/adt/date.c:3 src/backend/utils/adt/dbsize.c:2 src/backend/utils/adt/encode.c:1 src/backend/utils/adt/formatting.c:14 src/backend/utils/adt/genfile.c:3 src/backend/utils/adt/not_in.c:2 src/backend/utils/adt/quote.c:2 src/backend/utils/adt/regproc.c:1 src/backend/utils/adt/ruleutils.c:6 src/backend/utils/adt/tid.c:1 src/backend/utils/adt/timestamp.c:8 src/backend/utils/adt/tsquery_rewrite.c:1 src/backend/utils/adt/tsvector_op.c:3 src/backend/utils/adt/xml.c:24 src/backend/utils/mb/mbutils.c:1 src/tutorial/funcs_new.c:3 contrib/adminpack/adminpack.c:6 contrib/chkpass/chkpass.c:2 contrib/dblink/dblink.c:41 contrib/fuzzystrmatch/dmetaphone.c:2 contrib/fuzzystrmatch/fuzzystrmatch.c:6 contrib/hstore/hstore_gin.c:1 contrib/hstore/hstore_gist.c:1 contrib/hstore/hstore_op.c:6 contrib/intarray/_int_op.c:1 contrib/ltree/ltree_op.c:3 contrib/pageinspect/btreefuncs.c:3 contrib/pageinspect/rawpage.c:1 contrib/pg_trgm/trgm_gin.c:2 contrib/pg_trgm/trgm_gist.c:1 contrib/pg_trgm/trgm_op.c:3 contrib/pgcrypto/pgcrypto.c:10 contrib/pgcrypto/pgp-pgsql.c:1 contrib/pgrowlocks/pgrowlocks.c:1 contrib/pgstattuple/pgstatindex.c:2 contrib/pgstattuple/pgstattuple.c:1 contrib/sslinfo/sslinfo.c:2 contrib/tablefunc/tablefunc.c:14 contrib/tsearch2/dict.c:3 contrib/tsearch2/dict_ex.c:1 contrib/tsearch2/dict_ispell.c:1 contrib/tsearch2/dict_snowball.c:3 contrib/tsearch2/dict_syn.c:1 contrib/tsearch2/dict_thesaurus.c:1 contrib/tsearch2/query.c:4 contrib/tsearch2/query_rewrite.c:1 contrib/tsearch2/ts_cfg.c:1 contrib/tsearch2/ts_stat.c:2 contrib/tsearch2/tsvector.c:2 contrib/tsearch2/wparser.c:9 contrib/uuid-ossp/uuid-ossp.c:2 contrib/xml2/xpath.c:25 contrib/xml2/xslt_proc.c:3 -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > >> Gregory Stark <stark@enterprisedb.com> writes: >> >>> Ok, this removes what should be most if not all of the call sites where we're >>> detoasting text or byteas. In particular it gets all the regexp/like functions >>> and all the trim/pad functions. It also gets hashtext and hash_any. >>> >> Applied with some fixes --- you'd missed like_match.c, which doubtless >> explains Guillame's complaint that it didn't work ... >> > > Strange. It passed all regression tests for me and it seems like this is > something that would have been caught even in single-byte mode by a simple > test. It seems to me that like_match.c only used for SIMILAR is that right? > That would explain it as there don't appear to be any tests of SIMILAR. > > No. like_match.c contains the template for all the various incarnations of LIKE and ILIKE functions. It is included multiple times with different sets of #defines in like.c to create those functions (currently there are 4 of them). It also supplies the template for the like_escape functions, and this is where the macros are used. Those functions are apparently only called if there is an explicit ESCAPE clause. Some of our regression tests do have this, so I'm not sure what happened. cheers andrew
Gregory Stark <stark@enterprisedb.com> writes: > This may have been a misdiagnosis of the buildfarm failures but it looks like > a correct bug fix to me. That is, it looks like regexp_split_to_array() was > capable of passing a packed varlena to setup_regexp_matches which wasn't > expecting it. But this I don't understand why it wouldn't cause regression > failures and indeed when I tested it with my build it didn't cause any > problems. The particular regression tests we have for those functions seem to pass constants to them, not table columns, and so they don't see packed inputs. (It might be interesting to make textin produce a packed result when possible, just to see what breaks; but I would be afraid to try to do that for production...) > This all brings up the question of what other files should be considered for > fixing. I'm very much against such a wholesale edit as you seem to have in mind here. We already had some destabilization from the limited patch that went in; now when we're trying to get to beta is not the time for more. Maybe at the beginning of 8.4 devel cycle would be a reasonable time to consider touching a lot of files. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: > > (It might be interesting to make textin produce a packed result when > possible, just to see what breaks; but I would be afraid to try to do > that for production...) > >> This all brings up the question of what other files should be considered for >> fixing. > > I'm very much against such a wholesale edit as you seem to have in mind > here. We already had some destabilization from the limited patch that > went in; now when we're trying to get to beta is not the time for more. > Maybe at the beginning of 8.4 devel cycle would be a reasonable time > to consider touching a lot of files. Well I did expect that sort of concern if I went ahead and did all of them, or nearly all of them. That's why I'm asking if any of the list seem like they might be important enough to do now. For 8.4 I'm starting to think it would make sense to make the distinction between a "real" varlena and a possibly-unaligned pointer so text* wouldn't be an ambiguous type which might not be what it appears to be. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > (It might be interesting to make textin produce a packed result when > possible, just to see what breaks; but I would be afraid to try to do > that for production...) Reassuringly all checks pass with a hack like that in place. (attached) I think the right approach here is to define a new type text_packed * (which would just be a char* or varattrib_1b* or something like that). Then PG_GETARG_*_PP would return one of these new pointers. This does leave us with warnings whenever an old-style function calls a new-style function but I think there would be relatively few of those since we'll tackle the higher traffic areas first which will be the lower level functions. The benefit is that it will give us a warning if we try to pass a pointer from a new-style function to an old-style function which isn't prepared to receive it. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Attachment
"Gregory Stark" <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> (It might be interesting to make textin produce a packed result when >> possible, just to see what breaks; but I would be afraid to try to do >> that for production...) > > Reassuringly all checks pass with a hack like that in place. (attached) For the record I've been doing some more testing and found one place that could be a problem down the road. I'm not sure why it didn't show up previously. In selfuncs.c we use VARDATA/VARSIZE on data that is taken from parser Const nodes and from the histogram arrays without detoasting them. Currently this is safe as array elements are not packed and parser nodes contain values read using textin and never stored in a tuple. But down the road I expect we'll want to pack array element so this code would need to detoast the elements or prepare to handle packed elements. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > For the record I've been doing some more testing and found one place that > could be a problem down the road. I'm not sure why it didn't show up > previously. In selfuncs.c we use VARDATA/VARSIZE on data that is taken from > parser Const nodes and from the histogram arrays without detoasting them. > Currently this is safe as array elements are not packed and parser nodes > contain values read using textin and never stored in a tuple. But down the > road I expect we'll want to pack array element so this code would need to > detoast the elements or prepare to handle packed elements. Hmmm ... I think this should be fixed now, actually. I'm far from convinced that a Const could never contain a toasted datum. Consider constant-folding in the planner --- it just stuffs the result of a function into a Const node. In fact, it seems there's a different risk here: if such a datum were toasted out-of-line, the reference in a cached plan might live longer than the underlying toast-table data. Maybe we need a forcible detoast in evaluate_function(). regards, tom lane
I wrote: > In fact, it seems there's a different risk here: if such a datum were > toasted out-of-line, the reference in a cached plan might live longer > than the underlying toast-table data. Maybe we need a forcible detoast > in evaluate_function(). Sure enough, it seems we do. The attached script fails in every release back to 7.3. It's a rather contrived case, because a function marked immutable probably shouldn't be reading from a table at all, especially not one you are likely to change or drop. That's probably why we've not heard reports of this before. But it's definitely a bug. regards, tom lane create table big(f1 text); insert into big values(repeat('xyzzy',100000)); create or replace function getbig() returns text as 'select f1 from big' language sql immutable; create or replace function usebig(text) returns bool as ' begin return $1 ~ ''xyzzy''; end ' language plpgsql; prepare foo as select usebig(getbig()) from int4_tbl; execute foo; drop table big; execute foo;
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > Gregory Stark <stark@enterprisedb.com> writes: > >> Ok, this removes what should be most if not all of the call sites where we're > >> detoasting text or byteas. In particular it gets all the regexp/like functions > >> and all the trim/pad functions. It also gets hashtext and hash_any. > > > > Applied with some fixes --- you'd missed like_match.c, which doubtless > > explains Guillame's complaint that it didn't work ... > > Strange. It passed all regression tests for me and it seems like this is > something that would have been caught even in single-byte mode by a simple > test. It seems to me that like_match.c only used for SIMILAR is that right? > That would explain it as there don't appear to be any tests of SIMILAR. > > Separately: > > > Although I'd misdiagnosed the reason for the recent failures on buildfarm > > member grebe, I see no reason to revert the 1-byte-header-friendly changes I > > made in varlena.c. Instead, tweak the code a little bit to get more > > advantage out of that. > > This may have been a misdiagnosis of the buildfarm failures but it looks like > a correct bug fix to me. That is, it looks like regexp_split_to_array() was > capable of passing a packed varlena to setup_regexp_matches which wasn't > expecting it. But this I don't understand why it wouldn't cause regression > failures and indeed when I tested it with my build it didn't cause any > problems. > > > This all brings up the question of what other files should be considered for > fixing. There is the possibility that some of the other sites could turn out > to be performance critical for a given application. In particular I'm > primarily concerned with calls which could be invoked during a data load or > index build. > > Of the following list it seems to me the calls in adt/acl.c are probably > interesting to fix. Especially since it seems we could just replace all those > text* parameters with Datum parameters since they're only going to be passed > to textin anyways. > > After that the tsearch and xml data types are probably interesting, and the > various data type input functions and contrib gist/gin operator classes. > > I'm fine doing the drudge work but wanted to check before I do it that I'm not > doing something we don't think is worth doing at this point in time. > > src/backend/access/transam/xlog.c:3 > src/backend/catalog/pg_conversion.c:2 > src/backend/commands/sequence.c:1 > src/backend/libpq/be-fsstubs.c:2 > src/backend/tsearch/dict.c:3 > src/backend/tsearch/to_tsany.c:6 > src/backend/tsearch/wparser.c:6 > src/backend/utils/adt/acl.c:61 > src/backend/utils/adt/char.c:1 > src/backend/utils/adt/date.c:3 > src/backend/utils/adt/dbsize.c:2 > src/backend/utils/adt/encode.c:1 > src/backend/utils/adt/formatting.c:14 > src/backend/utils/adt/genfile.c:3 > src/backend/utils/adt/not_in.c:2 > src/backend/utils/adt/quote.c:2 > src/backend/utils/adt/regproc.c:1 > src/backend/utils/adt/ruleutils.c:6 > src/backend/utils/adt/tid.c:1 > src/backend/utils/adt/timestamp.c:8 > src/backend/utils/adt/tsquery_rewrite.c:1 > src/backend/utils/adt/tsvector_op.c:3 > src/backend/utils/adt/xml.c:24 > src/backend/utils/mb/mbutils.c:1 > src/tutorial/funcs_new.c:3 > > contrib/adminpack/adminpack.c:6 > contrib/chkpass/chkpass.c:2 > contrib/dblink/dblink.c:41 > contrib/fuzzystrmatch/dmetaphone.c:2 > contrib/fuzzystrmatch/fuzzystrmatch.c:6 > contrib/hstore/hstore_gin.c:1 > contrib/hstore/hstore_gist.c:1 > contrib/hstore/hstore_op.c:6 > contrib/intarray/_int_op.c:1 > contrib/ltree/ltree_op.c:3 > contrib/pageinspect/btreefuncs.c:3 > contrib/pageinspect/rawpage.c:1 > contrib/pg_trgm/trgm_gin.c:2 > contrib/pg_trgm/trgm_gist.c:1 > contrib/pg_trgm/trgm_op.c:3 > contrib/pgcrypto/pgcrypto.c:10 > contrib/pgcrypto/pgp-pgsql.c:1 > contrib/pgrowlocks/pgrowlocks.c:1 > contrib/pgstattuple/pgstatindex.c:2 > contrib/pgstattuple/pgstattuple.c:1 > contrib/sslinfo/sslinfo.c:2 > contrib/tablefunc/tablefunc.c:14 > contrib/tsearch2/dict.c:3 > contrib/tsearch2/dict_ex.c:1 > contrib/tsearch2/dict_ispell.c:1 > contrib/tsearch2/dict_snowball.c:3 > contrib/tsearch2/dict_syn.c:1 > contrib/tsearch2/dict_thesaurus.c:1 > contrib/tsearch2/query.c:4 > contrib/tsearch2/query_rewrite.c:1 > contrib/tsearch2/ts_cfg.c:1 > contrib/tsearch2/ts_stat.c:2 > contrib/tsearch2/tsvector.c:2 > contrib/tsearch2/wparser.c:9 > contrib/uuid-ossp/uuid-ossp.c:2 > contrib/xml2/xpath.c:25 > contrib/xml2/xslt_proc.c:3 > > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Added to TODO: * Research reducing deTOASTing in more places http://archives.postgresql.org/pgsql-hackers/2007-09/msg00895.php --------------------------------------------------------------------------- Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > Gregory Stark <stark@enterprisedb.com> writes: > >> Ok, this removes what should be most if not all of the call sites where we're > >> detoasting text or byteas. In particular it gets all the regexp/like functions > >> and all the trim/pad functions. It also gets hashtext and hash_any. > > > > Applied with some fixes --- you'd missed like_match.c, which doubtless > > explains Guillame's complaint that it didn't work ... > > Strange. It passed all regression tests for me and it seems like this is > something that would have been caught even in single-byte mode by a simple > test. It seems to me that like_match.c only used for SIMILAR is that right? > That would explain it as there don't appear to be any tests of SIMILAR. > > Separately: > > > Although I'd misdiagnosed the reason for the recent failures on buildfarm > > member grebe, I see no reason to revert the 1-byte-header-friendly changes I > > made in varlena.c. Instead, tweak the code a little bit to get more > > advantage out of that. > > This may have been a misdiagnosis of the buildfarm failures but it looks like > a correct bug fix to me. That is, it looks like regexp_split_to_array() was > capable of passing a packed varlena to setup_regexp_matches which wasn't > expecting it. But this I don't understand why it wouldn't cause regression > failures and indeed when I tested it with my build it didn't cause any > problems. > > > This all brings up the question of what other files should be considered for > fixing. There is the possibility that some of the other sites could turn out > to be performance critical for a given application. In particular I'm > primarily concerned with calls which could be invoked during a data load or > index build. > > Of the following list it seems to me the calls in adt/acl.c are probably > interesting to fix. Especially since it seems we could just replace all those > text* parameters with Datum parameters since they're only going to be passed > to textin anyways. > > After that the tsearch and xml data types are probably interesting, and the > various data type input functions and contrib gist/gin operator classes. > > I'm fine doing the drudge work but wanted to check before I do it that I'm not > doing something we don't think is worth doing at this point in time. > > src/backend/access/transam/xlog.c:3 > src/backend/catalog/pg_conversion.c:2 > src/backend/commands/sequence.c:1 > src/backend/libpq/be-fsstubs.c:2 > src/backend/tsearch/dict.c:3 > src/backend/tsearch/to_tsany.c:6 > src/backend/tsearch/wparser.c:6 > src/backend/utils/adt/acl.c:61 > src/backend/utils/adt/char.c:1 > src/backend/utils/adt/date.c:3 > src/backend/utils/adt/dbsize.c:2 > src/backend/utils/adt/encode.c:1 > src/backend/utils/adt/formatting.c:14 > src/backend/utils/adt/genfile.c:3 > src/backend/utils/adt/not_in.c:2 > src/backend/utils/adt/quote.c:2 > src/backend/utils/adt/regproc.c:1 > src/backend/utils/adt/ruleutils.c:6 > src/backend/utils/adt/tid.c:1 > src/backend/utils/adt/timestamp.c:8 > src/backend/utils/adt/tsquery_rewrite.c:1 > src/backend/utils/adt/tsvector_op.c:3 > src/backend/utils/adt/xml.c:24 > src/backend/utils/mb/mbutils.c:1 > src/tutorial/funcs_new.c:3 > > contrib/adminpack/adminpack.c:6 > contrib/chkpass/chkpass.c:2 > contrib/dblink/dblink.c:41 > contrib/fuzzystrmatch/dmetaphone.c:2 > contrib/fuzzystrmatch/fuzzystrmatch.c:6 > contrib/hstore/hstore_gin.c:1 > contrib/hstore/hstore_gist.c:1 > contrib/hstore/hstore_op.c:6 > contrib/intarray/_int_op.c:1 > contrib/ltree/ltree_op.c:3 > contrib/pageinspect/btreefuncs.c:3 > contrib/pageinspect/rawpage.c:1 > contrib/pg_trgm/trgm_gin.c:2 > contrib/pg_trgm/trgm_gist.c:1 > contrib/pg_trgm/trgm_op.c:3 > contrib/pgcrypto/pgcrypto.c:10 > contrib/pgcrypto/pgp-pgsql.c:1 > contrib/pgrowlocks/pgrowlocks.c:1 > contrib/pgstattuple/pgstatindex.c:2 > contrib/pgstattuple/pgstattuple.c:1 > contrib/sslinfo/sslinfo.c:2 > contrib/tablefunc/tablefunc.c:14 > contrib/tsearch2/dict.c:3 > contrib/tsearch2/dict_ex.c:1 > contrib/tsearch2/dict_ispell.c:1 > contrib/tsearch2/dict_snowball.c:3 > contrib/tsearch2/dict_syn.c:1 > contrib/tsearch2/dict_thesaurus.c:1 > contrib/tsearch2/query.c:4 > contrib/tsearch2/query_rewrite.c:1 > contrib/tsearch2/ts_cfg.c:1 > contrib/tsearch2/ts_stat.c:2 > contrib/tsearch2/tsvector.c:2 > contrib/tsearch2/wparser.c:9 > contrib/uuid-ossp/uuid-ossp.c:2 > contrib/xml2/xpath.c:25 > contrib/xml2/xslt_proc.c:3 > > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +