Re: website doc search is extremely SLOW - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: website doc search is extremely SLOW |
Date | |
Msg-id | Pine.GSO.4.58.0401031707160.11643@ra.sai.msu.su Whole thread Raw |
In response to | Re: website doc search is extremely SLOW ("Marc G. Fournier" <scrappy@postgresql.org>) |
Responses |
Re: website doc search is extremely SLOW
Re: website doc search is extremely SLOW |
List | pgsql-general |
Hi there, I hoped to release pilot version of www.pgsql.ru with full text search of postgresql related resources (currently we've crawled 27 sites, about 340K pages) but we started celebration NY too early :) Expect it tomorrow or monday. We have developed many search engines, some of them are based on PostgreSQL like tsearch2, OpenFTS and are best to be embedded into CMS for true online updating. Their power comes from access to documents attributes stored in database, so one could perform categorized search, restricted search (different rights, different document status, etc). The most close example would be search on archive of mailing lists, which should be embed such kind of full text search engine. fts.postgresql.org in his best time was one of implementation of such system. This is what I hope to have on www.pgsql.ru, if Marc will give us access to mailing list archives :) Another search engines we use are based on standard technology of inverted indices, they are best suited for indexing of semi-static collections od documents. We've full-fledged crawler, indexer and searcher. Online update of inverted indices is rather complex technological task and I'm not sure there are databases which have true online update. On www.pgsql.ru we use GTSearch which is generic text search engine we developed for vertical searches (for example, postgresql related resources). It has common set of features like phrase search, proximity ranking, site search, morphology, stemming support, cached documents, spell checking, similar search etc. I see several separate tasks: * official documents (documentation mostly) I'm not sure is there are some kind of CMS on www.postgresql.org, but if it's there the best way is to embed tsearch2 into CMS. You'll have fast, incremental search engine. There are many users of tsearch2 and I think embedding isn't very difficult problem. I estimate there are maximum 10-20K pages of documentation, nothing for tsearch2. * mailing lists archive mailing lists archive, which is constantly growing and also required incremental update, so tsearch2 also needed. Nice hardware like Marc has described would be more than enough. We have moderate dual PIII 1Ggz server and I hope it would be enough. * postgresql related resources I think this task should be solved using standard technique - crawler, indexer, searcher. Due to limited number of sites it's possible to keep indices more actual than major search engines, for example crawl once a week. This is what we currently have on pgsql.ru because it doesn't require any permissions and interaction with sites officials. Regards, Oleg On Wed, 31 Dec 2003, Marc G. Fournier wrote: > On Tue, 30 Dec 2003, Joshua D. Drake wrote: > > > Hello, > > > > Why are we not using Tsearch2? > > Because nobody has built it yet? Oleg's stuff is nice, but we want > something that we can build into the existing web sites, not a standalone > site ... > > I keep searching the web hoping someone has come up with a 'tsearch2' > based search engine that does the spidering, but, unless its sitting right > in front of my eyes and I'm not seeing it, I haven't found it yet :( > > Out of everything I've found so far, mnogosearch is one of the best ... I > just wish I could figure out where the bottleneck for it was, since, from > reading their docs, their method of storing the data doesn't appear to be > particularly off. I'm tempted to try their caching storage manager, and > getting away from SQL totally, but I *really* want to showcase PostgreSQL > on this :( > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-general by date: