Thread: External search engine, advice
I have an external search engine system which plugs in to postgres. I use a few C functions to interface the search daemon with the Postgres back-end. The best that I have been able to do is do a "select" for each result. I have a live demo/test site: http://www.mohawksoft.com/search.php3, and the PHP source code is at http://www.mohawksoft.com/ftss_example.txt. I would love to get the results with one select statement, but have, to date, been unable to figure out how. Anyone with any ideas?
> I have an external search engine system which plugs in to postgres. I use a few > C functions to interface the search daemon with the Postgres back-end. > > The best that I have been able to do is do a "select" for each result. I have a > live demo/test site: > > http://www.mohawksoft.com/search.php3, and the PHP source code is at > http://www.mohawksoft.com/ftss_example.txt. > > I would love to get the results with one select statement, but have, to date, > been unable to figure out how. Anyone with any ideas? It's possible to return a set of results from C functions using the new function manager in 7.1 or later. Take a look at following email in the archive. Subject: Re: [INTERFACES] Re: can external C-function get multiple rows? From: Tom Lane <tgl@sss.pgh.pa.us> To: alexey@price.ru cc: pgsql-interfaces@postgresql.org Date: Mon, 30 Apr 2001 01:52:57 -0400 Actually I have created such a function calling an external full text search engine called "namazu". Here is an example to search a keyword "int8" from index files pre-generated by namazu. test=# select pgnmzsrch('int8','/home/t-ishii/lib/namazu/hackers'); ?column? ----------------------------------------/home/t-ishii/lib/namazu/hackers/21000/home/t-ishii/lib/namazu/hackers/21001/home/t-ishii/lib/namazu/hackers/21003/home/t-ishii/lib/namazu/hackers/21004/home/t-ishii/lib/namazu/hackers/21002/home/t-ishii/lib/namazu/hackers/21005/home/t-ishii/lib/namazu/hackers/21006 (7 rows) -- Tatsuo Ishii
Tatsuo Ishii wrote: > > > I have an external search engine system which plugs in to postgres. I use a few > > C functions to interface the search daemon with the Postgres back-end. > > > > The best that I have been able to do is do a "select" for each result. I have a > > live demo/test site: > > > > http://www.mohawksoft.com/search.php3, and the PHP source code is at > > http://www.mohawksoft.com/ftss_example.txt. > > > > I would love to get the results with one select statement, but have, to date, > > been unable to figure out how. Anyone with any ideas? > > It's possible to return a set of results from C functions using the > new function manager in 7.1 or later. Take a look at following email > in the archive. Well, I kind of have that already. I can return a set, but I can't use it in a join. freedb=# select ftss_search('all { pink floyd money }') ;ftss_search ------------- 120 (1 row) freedb=# select * from cdsongs where songid = ftss_results() ; ERROR: Set-valued function called in context that cannot accept a set How do you join against a set?
> Well, I kind of have that already. I can return a set, but I can't use it in a > join. > > freedb=# select ftss_search('all { pink floyd money }') ; > ftss_search > ------------- > 120 > (1 row) > > freedb=# select * from cdsongs where songid = ftss_results() ; > ERROR: Set-valued function called in context that cannot accept a set > > How do you join against a set? Well, assuming that ftss_results() returns a set of songid, you could do something like: select * from cdsongs where songid in (select ftss_results()); BTW, what's the difference between ftss_search and ftss_results? -- Tatsuo Ishii
mlw <markw@mohawksoft.com> writes: > freedb=# select * from cdsongs where songid = ftss_results() ; > ERROR: Set-valued function called in context that cannot accept a set '=' is a scalar operation. Try select * from cdsongs where songid IN (select ftss_results()); regards, tom lane
Tom Lane wrote: > > mlw <markw@mohawksoft.com> writes: > > freedb=# select * from cdsongs where songid = ftss_results() ; > > ERROR: Set-valued function called in context that cannot accept a set > > '=' is a scalar operation. Try > > select * from cdsongs where songid IN (select ftss_results()); I was afraid you'd say that. That does not use indexes. It is pointless to use a text search engine if the result has to perform a table scan anyway. If I do: create temp table fubar as select ftss_results() as songid; select * from cdsongs where songid = fubar.songid; That works, but that is slow and a lot of people have emotional difficulties with using temporary tables. (Oracle syndrome) Also, an 'IN' clause does not preserve the order of the results, where as a join should.
Tatsuo Ishii wrote: > > > Well, I kind of have that already. I can return a set, but I can't use it in a > > join. > > > > freedb=# select ftss_search('all { pink floyd money }') ; > > ftss_search > > ------------- > > 120 > > (1 row) > > > > freedb=# select * from cdsongs where songid = ftss_results() ; > > ERROR: Set-valued function called in context that cannot accept a set > > > > How do you join against a set? > > Well, assuming that ftss_results() returns a set of songid, you could > do something like: > > select * from cdsongs where songid in (select ftss_results()); That, however, does not use the songid index, thus it renders the text search engine useless. > > BTW, what's the difference between ftss_search and ftss_results? ftss_search executes the search to the external engine, and returns the number of results. ftss_results returns the set of results.
mlw <markw@mohawksoft.com> writes: > If I do: > create temp table fubar as select ftss_results() as songid; > select * from cdsongs where songid = fubar.songid; > That works, but that is slow and a lot of people have emotional difficulties > with using temporary tables. If you don't like temp tables, try select cdsongs.* from cdsongs, (select ftss_results() as ftss) as tmp where songid = tmp.ftss; which'll produce the same results. Do I need to point out that the semantics aren't the same as with IN? (Unless the output of ftss_results is guaranteed unique...) > Also, an 'IN' clause does not > preserve the order of the results, where as a join should. This statement is flat-out wrong --- don't you know that SQL makes no promises about tuple ordering? regards, tom lane
mlw wrote: > > Tom Lane wrote: > > > > mlw <markw@mohawksoft.com> writes: > > > freedb=# select * from cdsongs where songid = ftss_results() ; > > > ERROR: Set-valued function called in context that cannot accept a set > > > > '=' is a scalar operation. Try > > > > select * from cdsongs where songid IN (select ftss_results()); > > I was afraid you'd say that. That does not use indexes. > > It is pointless to use a text search engine if the result has to perform a > table scan anyway. > > If I do: > > create temp table fubar as select ftss_results() as songid; > select * from cdsongs where songid = fubar.songid; > > That works, but that is slow and a lot of people have emotional difficulties > with using temporary tables. (Oracle syndrome) Also, an 'IN' clause does not > preserve the order of the results, where as a join should. So the standard answer to "IN doesn't use indexes" is to use EXISTS instead. I'm surely being hopelessly naive here, but why won't that work in this case? Regards, Andrew. -- _____________________________________________________________________ Andrew McMillan, e-mail: Andrew@catalyst.net.nz Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington Me: +64(21)635-694, Fax: +64(4)499-5596, Office: +64(4)499-2267xtn709
mlw wrote: > > I have an external search engine system which plugs in to postgres. I use a few > C functions to interface the search daemon with the Postgres back-end. > > The best that I have been able to do is do a "select" for each result. I have a > live demo/test site: > > http://www.mohawksoft.com/search.php3, and the PHP source code is at > http://www.mohawksoft.com/ftss_example.txt. > > I would love to get the results with one select statement, but have, to date, > been unable to figure out how. Anyone with any ideas? Well, I think I got it, and I am posting so that people trying to do what I am doing, can look through the postings!! Datum ftss_search(PG_FUNCTION_ARGS) { int4 result; int state; if(!fcinfo->resultinfo) { PG_RETURN_NULL(); } state = search_state(); if(state == 0) { text * string= PG_GETARG_TEXT_P(0); int len = VARSIZE(string)-VARHDRSZ; char szString[len+1]; memcpy(szString, VARDATA(string), len); szString[len]=0; search(DEFAULT_PORT, DEFAULT_HOST, szString); } if(search_nextresult(&result)) { ReturnSetInfo *rsi = (ReturnSetInfo *)fcinfo->resultinfo; rsi->isDone = ExprMultipleResult; PG_RETURN_INT32(result); } else { ReturnSetInfo *rsi = (ReturnSetInfo *)fcinfo->resultinfo; rsi->isDone = ExprEndResult ; } PG_RETURN_NULL(); } The above is an example of how to write a function that returns multiple results. create function ftss_search (varchar)returns setof integeras '/usr/local/lib/library.so', 'ftss_search'language 'c' with(iscachable); The above in an example of how one would register this function in postgres. select table.* from table, (select fts_search('all { bla bla }') as key) as result where result.key = table.key; The above is an example of how to use this function. Thanks everyone for you help.
mlw <markw@mohawksoft.com> writes: > The above is an example of how to write a function that returns multiple > results. One suggestion: you must check not only that fcinfo->resultinfo isn't NULL, but that it points at the sort of node you're expecting. Say if (fcinfo->resultinfo == NULL || ! IsA(fcinfo->resultinfo, ReturnSetInfo)) <complain>; If you fail to do this, you can fully expect your code to coredump a version or two hence. Right now the only possibility for resultinfo is to point at a ReturnSetInfo, but that *will* change. > create function ftss_search (varchar) > returns setof integer > as '/usr/local/lib/library.so', 'ftss_search' > language 'c' with (iscachable); > The above in an example of how one would register this function in postgres. Hmm ... given that ftss refers to external files, is it a good idea to mark it cachable? I'd sort of expect that the values it returns for a particular argument could change over time. Cachable amounts to a promise that the results for a given argument will not change over time. regards, tom lane
Tom Lane wrote: > > mlw <markw@mohawksoft.com> writes: > > The above is an example of how to write a function that returns multiple > > results. > > One suggestion: you must check not only that fcinfo->resultinfo isn't > NULL, but that it points at the sort of node you're expecting. Say > > if (fcinfo->resultinfo == NULL || > ! IsA(fcinfo->resultinfo, ReturnSetInfo)) > <complain>; > OK, that makes sense. I will put that in. > If you fail to do this, you can fully expect your code to coredump > a version or two hence. Right now the only possibility for resultinfo > is to point at a ReturnSetInfo, but that *will* change. > > > create function ftss_search (varchar) > > returns setof integer > > as '/usr/local/lib/library.so', 'ftss_search' > > language 'c' with (iscachable); > > > The above in an example of how one would register this function in postgres. > > Hmm ... given that ftss refers to external files, is it a good idea to > mark it cachable? I'd sort of expect that the values it returns for > a particular argument could change over time. Cachable amounts to a > promise that the results for a given argument will not change over time. This I don't understand. What is the lifetime of a value that "iscacheable?" Not using "iscacheable" will force a table scan, but are you saying that when a result is marked "iscacheable" it lasts the life time of the postgres session? From what I've been able to tell, a function's value which has been cached seems only to last the life of a transaction. For instance: select * from table where field = fubar ('bla bla') ; When executed, fubar gets called once. On the next invocation of the same query, fubar is again called. So I don't think cacheable has any more persistence than transaction. If this isn't the case, then YIKES!
mlw <markw@mohawksoft.com> writes: >> Hmm ... given that ftss refers to external files, is it a good idea to >> mark it cachable? > This I don't understand. What is the lifetime of a value that "iscacheable?" Forever. cachable says it's OK to reduce "func(constant)" to "constant" on sight. Right now it's not really forever because we don't save query plans for very long (unless they're inside a plpgsql function) ... but if you have a function that depends on any outside data besides its arguments, you'd be ill-advised to mark it cachable. regards, tom lane
Tom Lane wrote: > > mlw <markw@mohawksoft.com> writes: > >> Hmm ... given that ftss refers to external files, is it a good idea to > >> mark it cachable? > > > This I don't understand. What is the lifetime of a value that "iscacheable?" > > Forever. cachable says it's OK to reduce "func(constant)" to "constant" > on sight. Right now it's not really forever because we don't save query > plans for very long (unless they're inside a plpgsql function) ... but > if you have a function that depends on any outside data besides its > arguments, you'd be ill-advised to mark it cachable. That's scary!!! I can sort of see why you'd want that, but can you also see why a developer would not want that? Take this query: select * from table where field = function(...); Without the "iscacheable" flag, this function will force a table scan, but although the returned value may change over time, it would not change for this particular transaction. Some functions need to be called each time they are evaluated. Some functions need to be called only once per transaction. Do you really see any need for a function's result to have a lifetime beyond its transaction? I see a real danger in preserving the value of a function across a transaction. Granted, things like "create index fubar_ndx on fubar (function(field));" depend on this behavior, but other applications will have problems. How do we get a cached value of a function that exists for a transaction, such that we can use indexes, and how do we identify the functions who's results should have a longer lifetime? Am I out in left field here? Does anyone see this as a problem? I guess there should be three states to the lifetime of a functions return value?
mlw <markw@mohawksoft.com> writes: > Am I out in left field here? Does anyone see this as a problem? I guess there > should be three states to the lifetime of a functions return value? There has been some talk of that, but nailing down exactly what the semantics ought to be still needs more thought. As far as optimizing indexscans goes, the correct intermediate concept would be something like "result is fixed within any one scan", not any one transaction. You wouldn't really want to find that begin;select * from foo where x = functhatreadsbar();update bar ...;select * from foo where x = functhatreadsbar();end; does not give you the desired results. regards, tom lane
At 01:44 PM 5/20/01 -0400, Tom Lane wrote: >As far as optimizing indexscans goes, the correct intermediate concept >would be something like "result is fixed within any one scan", not any >one transaction. You wouldn't really want to find that > > begin; > select * from foo where x = functhatreadsbar(); > update bar ...; > select * from foo where x = functhatreadsbar(); > end; > >does not give you the desired results. No, you certainly wouldn't want that. Cached for the extent of a statement might make sense. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Tom Lane wrote: > > begin; > select * from foo where x = functhatreadsbar(); > update bar ...; > select * from foo where x = functhatreadsbar(); > end; > > does not give you the desired results. But why would you be marking the function 'iscachable' if you wanted to see the change there? Cheers, Andrew. -- _____________________________________________________________________ Andrew McMillan, e-mail: Andrew@catalyst.net.nz Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington Me: +64(21)635-694, Fax: +64(4)499-5596, Office: +64(4)499-2267xtn709
Andrew McMillan wrote: > > Tom Lane wrote: > > > > begin; > > select * from foo where x = functhatreadsbar(); > > update bar ...; > > select * from foo where x = functhatreadsbar(); > > end; > > > > does not give you the desired results. > > But why would you be marking the function 'iscachable' if you wanted to see the > change there? Because if there is an index on 'x' you would want to use it instead of performing a full table scan. If table 'foo' has millions of records, and functhatreadsbar() return one value, an operation that can take milliseconds, not takes seconds with no benefit.
Tom Lane wrote: > > mlw <markw@mohawksoft.com> writes: > > Am I out in left field here? Does anyone see this as a problem? I guess there > > should be three states to the lifetime of a functions return value? > > There has been some talk of that, but nailing down exactly what the > semantics ought to be still needs more thought. > > As far as optimizing indexscans goes, the correct intermediate concept > would be something like "result is fixed within any one scan", not any > one transaction. You wouldn't really want to find that > > begin; > select * from foo where x = functhatreadsbar(); > update bar ...; > select * from foo where x = functhatreadsbar(); > end; > > does not give you the desired results. OK, what is one to do? There is an obvious need to use functions which return a single value, and which can be assumed "frozen' for the life of a query or transaction, but would absolutely break if they could never change after that. This distinction from "iscachable" is vitally important to people coding functions for Postgres. I know a lot of what I have written for postgres would break if the desired meaning of "iscachable" were to be applied.