Thread: External search engine, advice

External search engine, advice

From

mlw

Date:

18 May 2001, 19:22:47

I have an external search engine system which plugs in to postgres. I use a few
C functions to interface the search daemon with the Postgres back-end.

The best that I have been able to do is do a "select" for each result. I have a
live demo/test site:

http://www.mohawksoft.com/search.php3, and the PHP source code is at
http://www.mohawksoft.com/ftss_example.txt.

I would love to get the results with one select statement, but have, to date,
been unable to figure out how. Anyone with any ideas?

Re: External search engine, advice

From

Tatsuo Ishii

Date:

18 May 2001, 23:06:38

> I have an external search engine system which plugs in to postgres. I use a few
> C functions to interface the search daemon with the Postgres back-end.
> 
> The best that I have been able to do is do a "select" for each result. I have a
> live demo/test site:
> 
> http://www.mohawksoft.com/search.php3, and the PHP source code is at
> http://www.mohawksoft.com/ftss_example.txt.
> 
> I would love to get the results with one select statement, but have, to date,
> been unable to figure out how. Anyone with any ideas?

It's possible to return a set of results from C functions using the
new function manager in 7.1 or later. Take a look at following email
in the archive.

Subject: Re: [INTERFACES] Re: can external C-function get multiple rows? 
From: Tom Lane <tgl@sss.pgh.pa.us>
To: alexey@price.ru
cc: pgsql-interfaces@postgresql.org
Date: Mon, 30 Apr 2001 01:52:57 -0400

Actually I have created such a function calling an external full text
search engine called "namazu". Here is an example to search a keyword
"int8" from index files pre-generated by namazu.

test=# select pgnmzsrch('int8','/home/t-ishii/lib/namazu/hackers');               ?column?                

----------------------------------------/home/t-ishii/lib/namazu/hackers/21000/home/t-ishii/lib/namazu/hackers/21001/home/t-ishii/lib/namazu/hackers/21003/home/t-ishii/lib/namazu/hackers/21004/home/t-ishii/lib/namazu/hackers/21002/home/t-ishii/lib/namazu/hackers/21005/home/t-ishii/lib/namazu/hackers/21006
(7 rows)
--
Tatsuo Ishii

Re: External search engine, advice

From

mlw

Date:

19 May 2001, 10:41:38

Tatsuo Ishii wrote:
> 
> > I have an external search engine system which plugs in to postgres. I use a few
> > C functions to interface the search daemon with the Postgres back-end.
> >
> > The best that I have been able to do is do a "select" for each result. I have a
> > live demo/test site:
> >
> > http://www.mohawksoft.com/search.php3, and the PHP source code is at
> > http://www.mohawksoft.com/ftss_example.txt.
> >
> > I would love to get the results with one select statement, but have, to date,
> > been unable to figure out how. Anyone with any ideas?
> 
> It's possible to return a set of results from C functions using the
> new function manager in 7.1 or later. Take a look at following email
> in the archive.

Well, I kind of have that already. I can return a set, but I can't use it in a
join.

freedb=# select ftss_search('all { pink floyd money }') ;ftss_search
-------------        120
(1 row)
freedb=# select * from cdsongs where songid = ftss_results() ;
ERROR:  Set-valued function called in context that cannot accept a set

How do you join against a set?

Re: Re: External search engine, advice

From

Tatsuo Ishii

Date:

19 May 2001, 11:14:01

> Well, I kind of have that already. I can return a set, but I can't use it in a
> join.
> 
> freedb=# select ftss_search('all { pink floyd money }') ;
>  ftss_search
> -------------
>          120
> (1 row)
>  
> freedb=# select * from cdsongs where songid = ftss_results() ;
> ERROR:  Set-valued function called in context that cannot accept a set
> 
> How do you join against a set?

Well, assuming that ftss_results() returns a set of songid, you could
do something like:

select * from cdsongs where songid in (select ftss_results());

BTW, what's the difference between ftss_search and ftss_results?
--
Tatsuo Ishii

Re: Re: External search engine, advice

From

Tom Lane

Date:

19 May 2001, 11:33:52

mlw <markw@mohawksoft.com> writes:
> freedb=# select * from cdsongs where songid = ftss_results() ;
> ERROR:  Set-valued function called in context that cannot accept a set

'=' is a scalar operation.  Try

select * from cdsongs where songid IN (select ftss_results());
        regards, tom lane

Re: External search engine, advice

From

mlw

Date:

19 May 2001, 12:15:14

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > freedb=# select * from cdsongs where songid = ftss_results() ;
> > ERROR:  Set-valued function called in context that cannot accept a set
> 
> '=' is a scalar operation.  Try
> 
> select * from cdsongs where songid IN (select ftss_results());

I was afraid you'd say that. That does not use indexes.

It is pointless to use a text search engine if the result has to perform a
table scan anyway.

If I do:

create temp table fubar as select ftss_results() as songid;
select * from cdsongs where songid = fubar.songid;

That works, but that is slow and a lot of people have emotional difficulties
with using temporary tables. (Oracle syndrome) Also, an 'IN' clause does not
preserve the order of the results, where as a join should.

Re: External search engine, advice

From

mlw

Date:

19 May 2001, 12:18:03

Tatsuo Ishii wrote:
> 
> > Well, I kind of have that already. I can return a set, but I can't use it in a
> > join.
> >
> > freedb=# select ftss_search('all { pink floyd money }') ;
> >  ftss_search
> > -------------
> >          120
> > (1 row)
> >
> > freedb=# select * from cdsongs where songid = ftss_results() ;
> > ERROR:  Set-valued function called in context that cannot accept a set
> >
> > How do you join against a set?
> 
> Well, assuming that ftss_results() returns a set of songid, you could
> do something like:
> 
> select * from cdsongs where songid in (select ftss_results());

That, however, does not use the songid index, thus it renders the text search
engine useless.

> 
> BTW, what's the difference between ftss_search and ftss_results?

ftss_search executes the search to the external engine, and returns the number
of results. ftss_results returns the set of results.

Re: External search engine, advice

From

Tom Lane

Date:

19 May 2001, 20:20:37

mlw <markw@mohawksoft.com> writes:
> If I do:
> create temp table fubar as select ftss_results() as songid;
> select * from cdsongs where songid = fubar.songid;
> That works, but that is slow and a lot of people have emotional difficulties
> with using temporary tables.

If you don't like temp tables, try

select cdsongs.* from cdsongs, (select ftss_results() as ftss) as tmp
where songid = tmp.ftss;

which'll produce the same results.

Do I need to point out that the semantics aren't the same as with IN?
(Unless the output of ftss_results is guaranteed unique...)

> Also, an 'IN' clause does not
> preserve the order of the results, where as a join should.

This statement is flat-out wrong --- don't you know that SQL makes no
promises about tuple ordering?
        regards, tom lane

Re: Re: External search engine, advice

From

Andrew McMillan

Date:

19 May 2001, 21:34:30

mlw wrote:
> 
> Tom Lane wrote:
> >
> > mlw <markw@mohawksoft.com> writes:
> > > freedb=# select * from cdsongs where songid = ftss_results() ;
> > > ERROR:  Set-valued function called in context that cannot accept a set
> >
> > '=' is a scalar operation.  Try
> >
> > select * from cdsongs where songid IN (select ftss_results());
> 
> I was afraid you'd say that. That does not use indexes.
> 
> It is pointless to use a text search engine if the result has to perform a
> table scan anyway.
> 
> If I do:
> 
> create temp table fubar as select ftss_results() as songid;
> select * from cdsongs where songid = fubar.songid;
> 
> That works, but that is slow and a lot of people have emotional difficulties
> with using temporary tables. (Oracle syndrome) Also, an 'IN' clause does not
> preserve the order of the results, where as a join should.

So the standard answer to "IN doesn't use indexes" is to use EXISTS instead.  I'm
surely being hopelessly naive here, but why won't that work in this case?

Regards,                    Andrew.
-- 
_____________________________________________________________________          Andrew McMillan, e-mail:
Andrew@catalyst.net.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64(21)635-694, Fax: +64(4)499-5596, Office: +64(4)499-2267xtn709

Re: External search engine, advice

From

mlw

Date:

19 May 2001, 22:35:25

mlw wrote:
> 
> I have an external search engine system which plugs in to postgres. I use a few
> C functions to interface the search daemon with the Postgres back-end.
> 
> The best that I have been able to do is do a "select" for each result. I have a
> live demo/test site:
> 
> http://www.mohawksoft.com/search.php3, and the PHP source code is at
> http://www.mohawksoft.com/ftss_example.txt.
> 
> I would love to get the results with one select statement, but have, to date,
> been unable to figure out how. Anyone with any ideas?

Well, I think I got it, and I am posting so that people trying to do what I am
doing, can look through the postings!!
 Datum ftss_search(PG_FUNCTION_ARGS) {           int4 result;           int state;
if(!fcinfo->resultinfo)          {                   PG_RETURN_NULL();           }           state = search_state();
      if(state == 0)           {                   text * string= PG_GETARG_TEXT_P(0);                   int len =
VARSIZE(string)-VARHDRSZ;                  char szString[len+1];                    memcpy(szString, VARDATA(string),
len);                  szString[len]=0;                   search(DEFAULT_PORT, DEFAULT_HOST, szString);           }
     if(search_nextresult(&result))           {                   ReturnSetInfo *rsi = (ReturnSetInfo
*)fcinfo->resultinfo;                  rsi->isDone = ExprMultipleResult;                   PG_RETURN_INT32(result);
     }           else           {                   ReturnSetInfo *rsi = (ReturnSetInfo *)fcinfo->resultinfo;
       rsi->isDone = ExprEndResult ;           }           PG_RETURN_NULL(); }
 

The above is an example of how to write a function that returns multiple
results.

create function ftss_search (varchar)returns setof integeras '/usr/local/lib/library.so', 'ftss_search'language 'c'
with(iscachable);
 

The above in an example of how one would register this function in postgres.

select table.* from table, (select fts_search('all { bla bla }') as key) as
result where result.key = table.key;

The above is an example of how to use this function.

Thanks everyone for you help.

Re: Re: External search engine, advice

From

Tom Lane

Date:

20 May 2001, 00:28:23

mlw <markw@mohawksoft.com> writes:
> The above is an example of how to write a function that returns multiple
> results.

One suggestion: you must check not only that fcinfo->resultinfo isn't
NULL, but that it points at the sort of node you're expecting.  Say
if (fcinfo->resultinfo == NULL ||    ! IsA(fcinfo->resultinfo, ReturnSetInfo))    <complain>;

If you fail to do this, you can fully expect your code to coredump
a version or two hence.  Right now the only possibility for resultinfo
is to point at a ReturnSetInfo, but that *will* change.

> create function ftss_search (varchar)
>     returns setof integer
>     as '/usr/local/lib/library.so', 'ftss_search'
>     language 'c' with (iscachable);

> The above in an example of how one would register this function in postgres.

Hmm ... given that ftss refers to external files, is it a good idea to
mark it cachable?  I'd sort of expect that the values it returns for
a particular argument could change over time.  Cachable amounts to a
promise that the results for a given argument will not change over time.
        regards, tom lane

Re: External search engine, advice

From

mlw

Date:

20 May 2001, 08:09:05

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > The above is an example of how to write a function that returns multiple
> > results.
> 
> One suggestion: you must check not only that fcinfo->resultinfo isn't
> NULL, but that it points at the sort of node you're expecting.  Say
> 
>         if (fcinfo->resultinfo == NULL ||
>             ! IsA(fcinfo->resultinfo, ReturnSetInfo))
>                 <complain>;
> 

OK, that makes sense. I will put that in.

> If you fail to do this, you can fully expect your code to coredump
> a version or two hence.  Right now the only possibility for resultinfo
> is to point at a ReturnSetInfo, but that *will* change.
> 
> > create function ftss_search (varchar)
> >       returns setof integer
> >       as '/usr/local/lib/library.so', 'ftss_search'
> >       language 'c' with (iscachable);
> 
> > The above in an example of how one would register this function in postgres.
> 
> Hmm ... given that ftss refers to external files, is it a good idea to
> mark it cachable?  I'd sort of expect that the values it returns for
> a particular argument could change over time.  Cachable amounts to a
> promise that the results for a given argument will not change over time.

This I don't understand. What is the lifetime of a value that "iscacheable?"
Not using "iscacheable" will force a table scan, but are you saying that when a
result is marked "iscacheable" it lasts the life time of the postgres session? 

From what I've been able to tell, a function's value which has been cached
seems only to last the life of a transaction. For instance:

select * from table where field = fubar ('bla bla') ;

When executed, fubar gets called once. On the next invocation of the same
query, fubar is again called. So I don't think cacheable has any more
persistence than transaction. If this isn't the case, then YIKES!

Re: External search engine, advice

From

Tom Lane

Date:

20 May 2001, 12:40:00

mlw <markw@mohawksoft.com> writes:
>> Hmm ... given that ftss refers to external files, is it a good idea to
>> mark it cachable?

> This I don't understand. What is the lifetime of a value that "iscacheable?"

Forever.  cachable says it's OK to reduce "func(constant)" to "constant"
on sight.  Right now it's not really forever because we don't save query
plans for very long (unless they're inside a plpgsql function) ... but
if you have a function that depends on any outside data besides its
arguments, you'd be ill-advised to mark it cachable.
        regards, tom lane

Re: External search engine, advice

From

mlw

Date:

20 May 2001, 13:32:06

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> >> Hmm ... given that ftss refers to external files, is it a good idea to
> >> mark it cachable?
> 
> > This I don't understand. What is the lifetime of a value that "iscacheable?"
> 
> Forever.  cachable says it's OK to reduce "func(constant)" to "constant"
> on sight.  Right now it's not really forever because we don't save query
> plans for very long (unless they're inside a plpgsql function) ... but
> if you have a function that depends on any outside data besides its
> arguments, you'd be ill-advised to mark it cachable.

That's scary!!!

I can sort of see why you'd want that, but can you also see why a developer
would not want that?

Take this query:

select * from table where field = function(...);

Without the "iscacheable" flag, this function will force a table scan, but
although the returned value may change over time, it would not change for this
particular transaction.

Some functions need to be called each time they are evaluated.
Some functions need to be called only once per transaction.

Do you really see any need for a function's result to have a lifetime beyond
its transaction? I see a real danger in preserving the value of a function
across a transaction. Granted, things like "create index fubar_ndx on fubar
(function(field));" depend on this behavior, but other applications will have
problems.

How do we get a cached value of a function that exists for a transaction, such
that we can use indexes, and how do we identify the functions who's results
should have a longer lifetime?

Am I out in left field here? Does anyone see this as a problem? I guess there
should be three states to the lifetime of a functions return value?

Re: External search engine, advice

From

Tom Lane

Date:

20 May 2001, 13:44:23

mlw <markw@mohawksoft.com> writes:
> Am I out in left field here? Does anyone see this as a problem? I guess there
> should be three states to the lifetime of a functions return value?

There has been some talk of that, but nailing down exactly what the
semantics ought to be still needs more thought.

As far as optimizing indexscans goes, the correct intermediate concept
would be something like "result is fixed within any one scan", not any
one transaction.  You wouldn't really want to find that
begin;select * from foo where x = functhatreadsbar();update bar ...;select * from foo where x =
functhatreadsbar();end;

does not give you the desired results.
        regards, tom lane

Re: Re: External search engine, advice

From

Don Baccus

Date:

20 May 2001, 14:29:41

At 01:44 PM 5/20/01 -0400, Tom Lane wrote:

>As far as optimizing indexscans goes, the correct intermediate concept
>would be something like "result is fixed within any one scan", not any
>one transaction.  You wouldn't really want to find that
>
>    begin;
>    select * from foo where x = functhatreadsbar();
>    update bar ...;
>    select * from foo where x = functhatreadsbar();
>    end;
>
>does not give you the desired results.

No, you certainly wouldn't want that.  Cached for the extent of a statement
might make sense.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: Re: External search engine, advice

From

Andrew McMillan

Date:

20 May 2001, 15:48:54

Tom Lane wrote:
> 
>         begin;
>         select * from foo where x = functhatreadsbar();
>         update bar ...;
>         select * from foo where x = functhatreadsbar();
>         end;
> 
> does not give you the desired results.

But why would you be marking the function 'iscachable' if you wanted to see the
change there?

Cheers,                Andrew.
-- 
_____________________________________________________________________          Andrew McMillan, e-mail:
Andrew@catalyst.net.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64(21)635-694, Fax: +64(4)499-5596, Office: +64(4)499-2267xtn709

Re: External search engine, advice

From

mlw

Date:

20 May 2001, 17:00:20

Andrew McMillan wrote:
> 
> Tom Lane wrote:
> >
> >         begin;
> >         select * from foo where x = functhatreadsbar();
> >         update bar ...;
> >         select * from foo where x = functhatreadsbar();
> >         end;
> >
> > does not give you the desired results.
> 
> But why would you be marking the function 'iscachable' if you wanted to see the
> change there?
Because if there is an index on 'x' you would want to use it instead of
performing a full table scan. If table 'foo' has millions of records, and
functhatreadsbar() return one value, an operation that can take milliseconds,
not takes seconds with no benefit.

Re: External search engine, advice

From

mlw

Date:

20 May 2001, 17:17:57

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > Am I out in left field here? Does anyone see this as a problem? I guess there
> > should be three states to the lifetime of a functions return value?
> 
> There has been some talk of that, but nailing down exactly what the
> semantics ought to be still needs more thought.
> 
> As far as optimizing indexscans goes, the correct intermediate concept
> would be something like "result is fixed within any one scan", not any
> one transaction.  You wouldn't really want to find that
> 
>         begin;
>         select * from foo where x = functhatreadsbar();
>         update bar ...;
>         select * from foo where x = functhatreadsbar();
>         end;
> 
> does not give you the desired results.

OK, what is one to do?

There is an obvious need to use functions which return a single value, and
which can be assumed "frozen' for the life of a query or transaction, but would
absolutely break if they could never change after that. This distinction from
"iscachable" is vitally important to people coding functions for Postgres. I
know a lot of what I have written for postgres would break if the desired
meaning of "iscachable" were to be applied.