Enhancing phonetic search support for more languages - GSoC 2010 - Mailing list pgsql-hackers
From | Dhiraj Lohiya |
---|---|
Subject | Enhancing phonetic search support for more languages - GSoC 2010 |
Date | |
Msg-id | h2rb268c9e91004071324r2ea2471p3135f5d4b485ad30@mail.gmail.com Whole thread Raw |
Responses |
Re: Enhancing phonetic search support for more languages
- GSoC 2010
Re: Enhancing phonetic search support for more languages - GSoC 2010 |
List | pgsql-hackers |
Hello
Samples:
I am Dhiraj Lohiya, Computer Science undergraduate from BITS Pilani. I wanted to propose idea to improvise upon the phonetic search support, initially for some Indian languages like Hindi and Marathi with a framework for extending it to other languages easily by contributing the rules in a simple format. I am looking to take it forward as a GSoC project. Check out if you find this interesting enough:
I plan to customize the soundex algorithm for all languages where each language could have a different phonetic equivalent class of rules (Generally around 20 rules for most Indian languages I have worked with). I would keep the approach layered so that support for multiple language rules could be easily contributed and more languages could be added by others.
Moreover, since it is important that once a base set of rules are defined by someone, the rules could themselves be added/evolve based on the user input and usage.
For instance, if many users(above a threshold set by us) insert some search string for which no wanted search result is retrieved, we could track what he finally selects and then accordingly append/modify our set of phonetic rules based on the phonetic mismatch amongst the query inserted and result wanted according to our set of rules. Using this, the rule sets it could evolve itself when we collect usage statistics from users based on their experience. This feature would add a new dimension to the searchfunctionality and would surely stand out.
Initially I plan to code this for few Indian languages like Hindi, Marathi etc. and define a simple way (probably a gui on concept based on GoogleImageLabeler, wherein two words which sound similar will be mapped for improving upon the rules set) in which rules for different languages can be directly added and then people knowing those languages could contribute.
Samples:
- Some case of Hindi songs,
- if I search for a song which has word "naiyya" but I spell the word as ''nayya", presently no result would be returned since this is not in the playlist.
- Moreover, if "pyar" is searched, the results vary than when "pyaar" is searched but it is easy to realize that both are the same and hence should give the same results.
Some background on this:
I have already worked out a basic customized version of soundex algorithm as a part of my intern project at PennyWiseSolutions and implemented it in java (which had features of self improving upon its rule set based on the 2 input phonetically similar words as well). Right now, the rule sets are designed only for Hindi and Marathi. The results are narrowed down pretty well with much less false positives and this works well with Marath and Hindi. Now since the algorithm part remains same (almost equivalent to soundex) and only the rule set of other languages is to be contributed which would be used by the algorithm to process, I guess this could do. Some specific customization that was done included not to take care of silent letters like in soundex since when spelling a Hindi word in English, users don't really use silent letters.
I would be glad to have more input on this.
--
Regards
Dhiraj Lohiya
pgsql-hackers by date: