Thread: tsearch2 on-demand dictionary loading & using functions in tsearch2
Hello, I'd like to ask about two separate things regarding tsearch2 in PostgreSQL 8.3. Firstly, I've noticed that dictionary is loaded on-demand specifically for each session, and apparently this behavior cannot be changed in any way. If that's the case, would it be reasonable to ask for an option to allow loading during Postgres startup, rather than during the first usage of the dictionary in each distinctive session? I am currently working with ispell dictionaries for multiple languages, each being approx. 3MB large. With a lookup within a single dictionary, the first ts_lexize takes over one second, which from user's point of view is quite a long time. I see several benefits of the suggested approach: * For those who do not use persistent connections of any sort, using ispell dictionaries right now inflicts a severe blow in application responsiveness. Loading the dictionaries during database startup instead would speed things up significantly. * Considering the database is loaded separately for each session, does this also imply that each running backend has a separate dictionary stored in memory? If that is the case, using eg. 2 dictionaries, each 3MB large, on a database server with 20 backends running would eat up as much as 120MB of RAM, while if the server loaded the dictionaries beforehand, the OS could (possibly) keep the dictionaries shared in memory. As for downsides, I only really see two: * Tracking updates of dictionaries - but it's reasonable to believe that new connections get open more often than the dictionary gets updated. Also, this might be easily solved by stat()-ing the dictionary file before starting up session, and only have the server reload it if there's a notified change. * Possibly complicated to implement? As for my second question, is it possible to use functions in tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as a postgres function. Thanks in advance for reply, Steve
> * Considering the database is loaded separately for each session, does > this also imply that each running backend has a separate dictionary > stored in memory? Yes. > > As for downsides, I only really see two: > * Tracking updates of dictionaries - but it's reasonable to believe > that new connections get open more often than the dictionary gets > updated. Also, this might be easily solved by stat()-ing the dictionary > file before starting up session, and only have the server reload it if > there's a notified change. > * Possibly complicated to implement? Keeping dictionary up to date - it's a most difficult part here. Configuration of dictionary might be done by ALTER command - so, parent process (and all currently running backends) should get that information to reload dictionary. > As for my second question, is it possible to use functions in tsearch2? > For example, writing my own stemmer in PL/pgSQL or in C as a postgres > function. Yes, of course, you can develop your dictionary (-ies) and parser. Dut only in C, because they are critical for performance. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Hmm, good point; I presume "accept the fact that settings change won't > propagate to other backends until reconnect" would not be acceptable > behavior, even if documented along with the relevant configuration option? I suppose so. That was one of the reasons to move tsearch into core and it will be too regrettable to lost that feature again. >>> As for my second question, is it possible to use functions in >>> tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as >>> a postgres function. > I've had something different in mind. Considering there are already > facilities to use functions, be it PL/pgSQL, PL/Python or C, why not > just use those with the condition that the function must accept > some-arguments and return some-result? Or would using this, even while > using C as the language used for the actual parser, slow things down too? API to dictionary and parser intentionally utilizes complex (and nested) C-structures to decrease overheads. During parse of text postgres makes two call of parser (one call - parser returns word, second - word delimiter. Space is a lexeme too! Although it's not a subject to index) and one call of dictionary per word. So, if your language can work with C-structures then you can use that language with tsearch with more or less performance pay. PL/pgSQL hasn't this capability. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> As for downsides, I only really see two: >> * Tracking updates of dictionaries - but it's reasonable to believe >> that new connections get open more often than the dictionary gets >> updated. Also, this might be easily solved by stat()-ing the >> dictionary file before starting up session, and only have the server >> reload it if there's a notified change. >> * Possibly complicated to implement? > > Keeping dictionary up to date - it's a most difficult part here. > Configuration of dictionary might be done by ALTER command - so, parent > process (and all currently running backends) should get that information > to reload dictionary. Hmm, good point; I presume "accept the fact that settings change won't propagate to other backends until reconnect" would not be acceptable behavior, even if documented along with the relevant configuration option? >> As for my second question, is it possible to use functions in >> tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as a >> postgres function. > > Yes, of course, you can develop your dictionary (-ies) and parser. Dut > only in C, because they are critical for performance. I've had something different in mind. Considering there are already facilities to use functions, be it PL/pgSQL, PL/Python or C, why not just use those with the condition that the function must accept some-arguments and return some-result? Or would using this, even while using C as the language used for the actual parser, slow things down too? Best regards, Steve
Teodor Sigaev <teodor@sigaev.ru> writes: >> Hmm, good point; I presume "accept the fact that settings change won't >> propagate to other backends until reconnect" would not be acceptable >> behavior, even if documented along with the relevant configuration option? > I suppose so. That was one of the reasons to move tsearch into core and it will > be too regrettable to lost that feature again. The whole idea is pretty much a nonstarter anyway, because the only place to "preload" dictionaries would be in the postmaster process. That will never fly for several reasons: 1. The postmaster doesn't have database access and so couldn't read the dictionary definition. 2. If it did have database access, it wouldn't know which database of the cluster to look in. 3. The point of the postmaster process is to be so simple as to be nearly crashproof. We would never accept a patch that made it execute dictionary-loading code, because of the likely impact on system reliability. (This is in fact the reason behind point #1.) 4. The approach would never work anyway on Windows, since that platform does fork/exec not just fork. What I think *is* worth doing is spending some time on making dictionary loading go faster. Has any focused profiling and optimization effort been spent on that code at all? One idea is to precompile the text-based config files into some format designed to load fast. (But it'd be premature to adopt that idea without some profiling results to prove where the bottleneck is.) regards, tom lane
Tom Lane wrote: > What I think *is* worth doing is spending some time on making dictionary > loading go faster. This is probably a stupid question, but: with PostgreSQL's use of shared memory, is it possible to load dictionaries into a small reserved shm area when the first backend starts, then use the preloaded copy in subsequent backends? That way the postmaster doesn't have to do any risky work. Anything that reduces backend startup costs and per-backend unshared memory would have to be a good thing. I've found it useful in the past to share resources with an mmap()ped file, too, especially if I want write protection from some or all processes. If the postmaster forked a process to generate the mmap()able compiled dictionary files on startup then it'd be pretty safe from any misbehaviour of the dictionary compiling process. Then again, I can't say I've personally noticed the cost of loading tsearch2 dictionaries. -- Craig Ringer
Craig Ringer wrote: > This is probably a stupid question, but: with PostgreSQL's use of > shared memory, is it possible to load dictionaries into a small > reserved shm area when the first backend starts, then use the > preloaded copy in subsequent backends? > > That way the postmaster doesn't have to do any risky work. > > Anything that reduces backend startup costs and per-backend unshared > memory would have to be a good thing. > > I've found it useful in the past to share resources with an mmap()ped > file, too, especially if I want write protection from some or all > processes. If the postmaster forked a process to generate the > mmap()able compiled dictionary files on startup then it'd be pretty > safe from any misbehaviour of the dictionary compiling process. > > Then again, I can't say I've personally noticed the cost of loading > tsearch2 dictionaries. So the dictionary will be parsed on the first usage by the given backend, and from that moment on, all running backends and all backends that will be spawned afterwards will have access to the parsed dictionary structures thanks to the shm? That seems to solve all issues - speed, memory and updating. Would this be a way to go? Obviously, it might boil down to "write a patch", but if someone actually wrote a patch, would this approach be acceptable? Thanks, Steve PS: Please, CC me, as I am off the list.
Re: tsearch2 on-demand dictionary loading & using functions in tsearch2
From
"Ivan Zolotukhin"
Date:
Hello, We definitely came across this issue recently. When new postgres backend is started it uses ~3MB of the memory accordingly to pmap. When one runs within this backend several typical queries that our application generates its consumed memory increases to 5-8MB which is not critical for us. But when one hits some FTS function with the token that requires ispell dictionaries to be loaded we instantly get 26MB of consumed memory in this backend. Having 50 backends behind pgbouncer all of them containing ~20MB redundant FTS data is a serious penalty on some hardware since during the peak load kernel invalidates huge parts of its disk cache with actually 'hot' data to allocate more RAM for postgres backends and we've got huge iowait as a result. We definitely observe this scenario on one of the servers now and ability to save so much RAM by putting some FTS data in shared memory would help here. We alter dictionaries once per couple of months and would endure even postgres restart after such a change. -- Regards, Ivan >> This is probably a stupid question, but: with PostgreSQL's use of >> shared memory, is it possible to load dictionaries into a small >> reserved shm area when the first backend starts, then use the >> preloaded copy in subsequent backends? >> >> That way the postmaster doesn't have to do any risky work. >> >> Anything that reduces backend startup costs and per-backend unshared >> memory would have to be a good thing. >> >> I've found it useful in the past to share resources with an mmap()ped >> file, too, especially if I want write protection from some or all >> processes. If the postmaster forked a process to generate the >> mmap()able compiled dictionary files on startup then it'd be pretty >> safe from any misbehaviour of the dictionary compiling process. >> >> Then again, I can't say I've personally noticed the cost of loading >> tsearch2 dictionaries. > > So the dictionary will be parsed on the first usage by the given backend, > and from that moment on, all running backends and all backends that will be > spawned afterwards will have access to the parsed dictionary structures > thanks to the shm? > > That seems to solve all issues - speed, memory and updating. Would this be a > way to go? Obviously, it might boil down to "write a patch", but if someone > actually wrote a patch, would this approach be acceptable?