Re: [GENERAL] Re: full text searching - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: [GENERAL] Re: full text searching |
Date | |
Msg-id | Pine.GSO.4.33.0102082306320.22966-100000@ra.sai.msu.su Whole thread Raw |
In response to | Re: [GENERAL] Re: full text searching (Ned Lilly <ned@greatbridge.com>) |
Responses |
Re: [GENERAL] Re: full text searching
|
List | pgsql-hackers |
On Thu, 8 Feb 2001, Ned Lilly wrote: > (bcc'ed to -hackers) > > Gunnar R|nning wrote: > > > Does anybody know how Oracle has implemented their "context" search or > > whatever it is called nowadays ? > > They're calling it Intermedia now ... http://www.oracle.com/intermedia/ > > I have yet to meet an Oracle customer who likes it. > > I think there's a lot of agreement that this is an area where Postgres > could use some work. I know Oleg Bartunov has done some interesting > work with Postgres and the search engine at the Russian portal site > "Rambler" ... http://www.rambler.ru/ . Oleg, could you talk a bit about > what you guys did? Well, we have FTS engine fully based on postgresql. It was developed specifically for indexing dynamic text collections like online news. It has support of morphology, uses coordinate information and sophisticated ranking of search results. Search and ranking are built in postgres. Currently the biggest collection we have is about 300,000 messages. We're not very happy with performance on such size collection and specifically to improve it we did researching in GiST area. Using GiST we did index support for integer arrays which greatly improves search performance ! Right now we are trying to understand how to improve sort performance, which is a final (we hope) stopper for our FTS. Let me explain a bit: Search performance is great, but in real life application we have to display result of search on Web page, page by page. Results could be sorted by relevancy or another parameter. In case of online news or mailing list archive results are sorted by publication date. We found that most time is spent to sort full set of results while we need just 10-15 rows to display on Web page (using ORDER BY .. LIMIT,OFFSET) Some queries in our case produce about 50,000 rows (search "Putin" for example) ! Sort time is enormous and eats all the performance gain we did for search. One solution we currently investigating is implementation of partial sort into postgres. We don't need to sort full set. Currently LIMIT provides rather simple optimization - only part of results are transferred from backend to client. We propose stop sorting after getting those part of results already sorted. From our experience and literature we know that 95% of all hits gets 2 first pages of search results. In our worst case with 50,000 rows we could get first page to display about 5-6 times faster if we do partial sorting. I understand it looks rather limited area for optimization but many people would appreciate such optimization. I remember when I asked Jan to implement LIMIT feature many friends momentally moved from mysql to postgres. This feature isn't standard but it's Web friendly and most web applications utilize it. We have a patch for 7.1, well, just a sketch we did for benchmarking purposes. Tom isn't happy and we still need some help from core developers. But time is for 7.1 release and we dont' want to bother developers right now. Anyway, for medium size collection our FTS is good enough even using plain 7.0.3. We was planning to release FTS as open source before new year but were messed with organizational problem (still have :-( > > If there's interest in spinning up a separate project to sit outside the > database, a la Intermedia or Verity, we'd be happy to sponsor such a > thing on our GreatBridge.org project hosting site (CVS, bug tracking, > mail lists, etc.) We plan to develope sample application - searching postgres mail archives ( I have collection from 1995) and present it for testing. If people will happy with performance and quality of results we could install it on www.postgresql.org. > > Regards, > Ned > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-hackers by date: