Thread: tsearch2 on-demand dictionary loading & using functions in tsearch2

tsearch2 on-demand dictionary loading & using functions in tsearch2

From

iSteve

Date:

17 May 2008, 10:58:23

Hello,

I'd like to ask about two separate things regarding tsearch2 in
PostgreSQL 8.3.

Firstly, I've noticed that dictionary is loaded on-demand specifically
for each session, and apparently this behavior cannot be changed in any way.

If that's the case, would it be reasonable to ask for an option to allow
loading during Postgres startup, rather than during the first usage of
the dictionary in each distinctive session?

I am currently working with ispell dictionaries for multiple languages,
each being approx. 3MB large. With a lookup within a single dictionary,
the first ts_lexize takes over one second, which from user's point of
view is quite a long time.

I see several benefits of the suggested approach:
  * For those who do not use persistent connections of any sort, using
ispell dictionaries right now inflicts a severe blow in application
responsiveness. Loading the dictionaries during database startup instead
would speed things up significantly.
  * Considering the database is loaded separately for each session, does
this also imply that each running backend has a separate dictionary
stored in memory? If that is the case, using eg. 2 dictionaries, each
3MB large, on a database server with 20 backends running would eat up as
much as 120MB of RAM, while if the server loaded the dictionaries
beforehand, the OS could (possibly) keep the dictionaries shared in memory.

As for downsides, I only really see two:
  * Tracking updates of dictionaries - but it's reasonable to believe
that new connections get open more often than the dictionary gets
updated. Also, this might be easily solved by stat()-ing the dictionary
file before starting up session, and only have the server reload it if
there's a notified change.
  * Possibly complicated to implement?



As for my second question, is it possible to use functions in tsearch2?
For example, writing my own stemmer in PL/pgSQL or in C as a postgres
function.

Thanks in advance for reply,
Steve

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

Teodor Sigaev

Date:

18 May 2008, 08:25:28

>  * Considering the database is loaded separately for each session, does
> this also imply that each running backend has a separate dictionary
> stored in memory?

Yes.
>
> As for downsides, I only really see two:
>  * Tracking updates of dictionaries - but it's reasonable to believe
> that new connections get open more often than the dictionary gets
> updated. Also, this might be easily solved by stat()-ing the dictionary
> file before starting up session, and only have the server reload it if
> there's a notified change.
>  * Possibly complicated to implement?

Keeping dictionary up to date - it's a most difficult part here. Configuration
of dictionary might be done by ALTER command - so, parent process (and all
currently running backends) should get that information to reload dictionary.


> As for my second question, is it possible to use functions in tsearch2?
> For example, writing my own stemmer in PL/pgSQL or in C as a postgres
> function.

Yes, of course, you can develop your dictionary (-ies) and parser. Dut only in
C, because they are critical for performance.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

Teodor Sigaev

Date:

18 May 2008, 12:12:14


> Hmm, good point; I presume "accept the fact that settings change won't
> propagate to other backends until reconnect" would not be acceptable
> behavior, even if documented along with the relevant configuration option?

I suppose so. That was one of the reasons to move tsearch into core and it will
be too regrettable to lost that feature again.

>>> As for my second question, is it possible to use functions in
>>> tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as
>>> a postgres function.
> I've had something different in mind. Considering there are already
> facilities to use functions, be it PL/pgSQL, PL/Python or C, why not
> just use those with the condition that the function must accept
> some-arguments and return some-result? Or would using this, even while
> using C as the language used for the actual parser, slow things down too?

API to dictionary and parser intentionally utilizes complex (and nested)
C-structures to decrease overheads. During parse of text postgres makes two call
of parser (one call - parser returns word, second - word delimiter. Space is a
lexeme too! Although it's not a subject to index) and one call of dictionary per
word. So, if your language can work with C-structures then you can use that
language with tsearch with more or less performance pay. PL/pgSQL hasn't this
capability.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

iSteve

Date:

18 May 2008, 13:11:39

Teodor Sigaev wrote:
>> As for downsides, I only really see two:
>>  * Tracking updates of dictionaries - but it's reasonable to believe
>> that new connections get open more often than the dictionary gets
>> updated. Also, this might be easily solved by stat()-ing the
>> dictionary file before starting up session, and only have the server
>> reload it if there's a notified change.
>>  * Possibly complicated to implement?
>
> Keeping dictionary up to date - it's a most difficult part here.
> Configuration of dictionary might be done by ALTER command - so, parent
> process (and all currently running backends) should get that information
> to reload dictionary.

Hmm, good point; I presume "accept the fact that settings change won't
propagate to other backends until reconnect" would not be acceptable
behavior, even if documented along with the relevant configuration option?

>> As for my second question, is it possible to use functions in
>> tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as a
>> postgres function.
>
> Yes, of course, you can develop your dictionary (-ies) and parser. Dut
> only in C, because they are critical for performance.

I've had something different in mind. Considering there are already
facilities to use functions, be it PL/pgSQL, PL/Python or C, why not
just use those with the condition that the function must accept
some-arguments and return some-result? Or would using this, even while
using C as the language used for the actual parser, slow things down too?

Best regards,
Steve

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

Tom Lane

Date:

18 May 2008, 13:56:21

Teodor Sigaev <teodor@sigaev.ru> writes:
>> Hmm, good point; I presume "accept the fact that settings change won't
>> propagate to other backends until reconnect" would not be acceptable
>> behavior, even if documented along with the relevant configuration option?

> I suppose so. That was one of the reasons to move tsearch into core and it will
> be too regrettable to lost that feature again.

The whole idea is pretty much a nonstarter anyway, because the only
place to "preload" dictionaries would be in the postmaster process.
That will never fly for several reasons:

1. The postmaster doesn't have database access and so couldn't read
the dictionary definition.

2. If it did have database access, it wouldn't know which database
of the cluster to look in.

3. The point of the postmaster process is to be so simple as to be
nearly crashproof.  We would never accept a patch that made it execute
dictionary-loading code, because of the likely impact on system
reliability.  (This is in fact the reason behind point #1.)

4. The approach would never work anyway on Windows, since that platform
does fork/exec not just fork.

What I think *is* worth doing is spending some time on making dictionary
loading go faster.  Has any focused profiling and optimization effort
been spent on that code at all?  One idea is to precompile the
text-based config files into some format designed to load fast.
(But it'd be premature to adopt that idea without some profiling
results to prove where the bottleneck is.)

            regards, tom lane

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

Craig Ringer

Date:

18 May 2008, 14:15:40

Tom Lane wrote:

> What I think *is* worth doing is spending some time on making dictionary
> loading go faster.

This is probably a stupid question, but: with PostgreSQL's use of shared
memory, is it possible to load dictionaries into a small reserved shm
area when the first backend starts, then use the preloaded copy in
subsequent backends?

That way the postmaster doesn't have to do any risky work.

Anything that reduces backend startup costs and per-backend unshared
memory would have to be a good thing.

I've found it useful in the past to share resources with an mmap()ped
file, too, especially if I want write protection from some or all
processes. If the postmaster forked a process to generate the mmap()able
compiled dictionary files on startup then it'd be pretty safe from any
misbehaviour of the dictionary compiling process.

Then again, I can't say I've personally noticed the cost of loading
tsearch2 dictionaries.

--
Craig Ringer

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

iSteve

Date:

22 May 2008, 19:40:41

Craig Ringer wrote:
> This is probably a stupid question, but: with PostgreSQL's use of
> shared memory, is it possible to load dictionaries into a small
> reserved shm area when the first backend starts, then use the
> preloaded copy in subsequent backends?
>
> That way the postmaster doesn't have to do any risky work.
>
> Anything that reduces backend startup costs and per-backend unshared
> memory would have to be a good thing.
>
> I've found it useful in the past to share resources with an mmap()ped
> file, too, especially if I want write protection from some or all
> processes. If the postmaster forked a process to generate the
> mmap()able compiled dictionary files on startup then it'd be pretty
> safe from any misbehaviour of the dictionary compiling process.
>
> Then again, I can't say I've personally noticed the cost of loading
> tsearch2 dictionaries.

So the dictionary will be parsed on the first usage by the given
backend, and from that moment on, all running backends and all backends
that will be spawned afterwards will have access to the parsed
dictionary structures thanks to the shm?

That seems to solve all issues - speed, memory and updating. Would this
be a way to go? Obviously, it might boil down to "write a patch", but if
someone actually wrote a patch, would this approach be acceptable?

Thanks,
Steve

PS: Please, CC me, as I am off the list.

Re: tsearch2 on-demand dictionary loading & using functions in tsearch2

From

"Ivan Zolotukhin"

Date:

23 May 2008, 12:35:51

Hello,

We definitely came across this issue recently. When new postgres
backend is started it uses ~3MB of the memory accordingly to pmap.
When one runs within this backend several typical queries that our
application generates its consumed memory increases to 5-8MB which is
not critical for us. But when one hits some FTS function with the
token that requires ispell dictionaries to be loaded we instantly get
26MB of consumed memory in this backend.

Having 50 backends behind pgbouncer all of them containing ~20MB
redundant FTS data is a serious penalty on some hardware since during
the peak load kernel invalidates huge parts of its disk cache with
actually 'hot' data to allocate more RAM for postgres backends and
we've got huge iowait as a result.

We definitely observe this scenario on one of the servers now and
ability to save so much RAM by putting some FTS data in shared memory
would help here. We alter dictionaries once per couple of months and
would endure even postgres restart after such a change.

--
Regards,
 Ivan


>> This is probably a stupid question, but: with PostgreSQL's use of
>> shared memory, is it possible to load dictionaries into a small
>> reserved shm area when the first backend starts, then use the
>> preloaded copy in subsequent backends?
>>
>> That way the postmaster doesn't have to do any risky work.
>>
>> Anything that reduces backend startup costs and per-backend unshared
>> memory would have to be a good thing.
>>
>> I've found it useful in the past to share resources with an mmap()ped
>> file, too, especially if I want write protection from some or all
>> processes. If the postmaster forked a process to generate the
>> mmap()able compiled dictionary files on startup then it'd be pretty
>> safe from any misbehaviour of the dictionary compiling process.
>>
>> Then again, I can't say I've personally noticed the cost of loading
>> tsearch2 dictionaries.
>
> So the dictionary will be parsed on the first usage by the given backend,
> and from that moment on, all running backends and all backends that will be
> spawned afterwards will have access to the parsed dictionary structures
> thanks to the shm?
>
> That seems to solve all issues - speed, memory and updating. Would this be a
> way to go? Obviously, it might boil down to "write a patch", but if someone
> actually wrote a patch, would this approach be acceptable?