experimental: TSearch dictionary [de]serialization - Mailing list pgsql-hackers
From | Pavel Stehule |
---|---|
Subject | experimental: TSearch dictionary [de]serialization |
Date | |
Msg-id | AANLkTinnim1joUog5bWsFW06uC4vVESZg6XoH40sbTSw@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hello I wrote a some very primitive code for testing serialization and de serialization of TSearch ISpell dictionary. This code working - but it is useful only for speed test now. Czech fulltext dictionary is serialized to cca 9MB long file. Saving needs about 90ms and reading needs same time. postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil žluté vody'); alias │ description │ token │ dictionaries │ dictionary │ lexemes ───────────┼───────────────────┼───────────┼─────────────────┼────────────┼───────────── word │ Word, all letters │ příliš │ {cspell,simple} │ cspell │ {příliš} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell │ {žluťoučký} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ kůň │ {cspell,simple} │ cspell │ {kůň} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell │ {napít} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluté │ {cspell,simple} │ cspell │ {žlutý} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell │ {voda} (13 rows) Time: 92.708 ms -- with using a preprocessed dictionary postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil žluté vody'); alias │ description │ token │ dictionaries │ dictionary │ lexemes ───────────┼───────────────────┼───────────┼─────────────────┼────────────┼───────────── word │ Word, all letters │ příliš │ {cspell,simple} │ cspell │ {příliš} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell │ {žluťoučký} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ kůň │ {cspell,simple} │ cspell │ {kůň} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell │ {napít} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluté │ {cspell,simple} │ cspell │ {žlutý} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell │ {voda} (13 rows) Time: 3.758 ms -- standard time (dictionary is loaded) postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil žluté vody'); alias │ description │ token │ dictionaries │ dictionary │ lexemes ───────────┼───────────────────┼───────────┼─────────────────┼────────────┼───────────── word │ Word, all letters │ příliš │ {cspell,simple} │ cspell │ {příliš} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell │ {žluťoučký} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ kůň │ {cspell,simple} │ cspell │ {kůň} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell │ {napít} blank │ Space symbols │ │ {} │ [null] │ [null] word │ Word, all letters │ žluté │ {cspell,simple} │ cspell │ {žlutý} blank │ Space symbols │ │ {} │ [null] │ [null] asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell │ {voda} (13 rows) Time: 518.528 ms --- typical first evaluation time So using a preprocessed file helps - the time of first processing is about 4x better. But still this time is 20x slower than using a loaded dictionary. I found a one issue - I am not able to serialize a full regexp. Czech dictionary doesn't use it, so I didn't solve this task. I would to like implement a few hooks to ISpellDictionary to be possible implement own memory management for ispell dictionaries. I understand to problems with shared memory or mmap - but I don't see any different way, than use a third party mmap support. This module must not be in core - probably this is only local Czech (and maybe Japan) problem. Regards Pavel Stehule
Attachment
pgsql-hackers by date: