Thread: Multiple word synonyms (maybe?)
Hi All I have a question regarding PostgreSQL's full text capabilities and (presumably) the synonym dictionary. I'm currently implementing FTS on a medical themed setup which uses domain specific jargon to denote a bunch of stuff. Aspecific request I wish to implement here are the jargon synonyms that are heavily used. Of course, I can simply go ahead and create my own synonym dictionary with a jargon specific synonym file to feed it. However,most of the synonyms are comprised out of more then a single word. The term "heart attack" for example has the following "synonyms": - Acute MI - MI - Myocardial infarction As far as I understand it, the tokenizer within PostgreSQL FTS engine splits words on spaces to generate tokens which arethen proposed to each dictionary. I think it is therefor impossible to have "multi-word synonyms" in this sense as multiplewords cannot reach the dictionary. The term "heart attack" would be presented as the tokens "heart" and "attack". From a technical standpoint I understand FTS is about looking at individual words and lexemizing them ... yet from a naturallanguage lookup perspective you still wish to tie "Heart attack" to "Acute MI" so when a client search on one, theother will turn up as well. Should I write my own tokenizer to catch all these words and present them as a single token? Or is this completely outsidethe realm of FTS (or FTS within Postgresql)? Cheers, Tim
On Tue, 2015-10-20 at 19:35 +0900, Tim van der Linden wrote: > Hi All > > I have a question regarding PostgreSQL's full text capabilities and > (presumably) the synonym dictionary. > > I'm currently implementing FTS on a medical themed setup which uses > domain specific jargon to denote a bunch of stuff. A specific request > I wish to implement here are the jargon synonyms that are heavily > used. > > Of course, I can simply go ahead and create my own synonym dictionary > with a jargon specific synonym file to feed it. However, most of the > synonyms are comprised out of more then a single word. > > The term "heart attack" for example has the following "synonyms": > > - Acute MI > - MI > - Myocardial infarction > > As far as I understand it, the tokenizer within PostgreSQL FTS engine > splits words on spaces to generate tokens which are then proposed to > each dictionary. I think it is therefor impossible to have "multi- > word synonyms" in this sense as multiple words cannot reach the > dictionary. The term "heart attack" would be presented as the tokens > "heart" and "attack". > > From a technical standpoint I understand FTS is about looking at > individual words and lexemizing them ... yet from a natural language > lookup perspective you still wish to tie "Heart attack" to "Acute MI" > so when a client search on one, the other will turn up as well. > > Should I write my own tokenizer to catch all these words and present > them as a single token? Or is this completely outside the realm of > FTS (or FTS within Postgresql)? > > Cheers, > Tim > > Looking at this from an entirely different perspective, why are you not using ICD codes to identify patient events? It is a one to many relationship between patient and their events identified by the relevant ICD code and date. Given that MI has several applicable ICD codes you can use a select along the lines of:- WHERE icd_code IN ( . . . ) I know it doesn't answer your question! Cheers, Rob
Of course, I can simply go ahead and create my own synonym dictionary with a jargon specific synonym file to feed it. However, most of the synonyms are comprised out of more then a single word.
Does the Thesaurus dictionary not do what you want?
Geoff
On Tuesday, October 20, 2015 6:05 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote: > On 20 October 2015 at 11:35, Tim van der Linden <tim@shisaa.jp>wrote: >> Of course, I can simply go ahead and create my own synonym >> dictionary with a jargon specific synonym file to feed it. However, >> most of the synonyms are comprised out of more then a single word. > > Does the Thesaurus dictionary not do what you want? > > http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS +1 I had a very similar need for legal terms (e.g., "power of attorney") and the thesaurus fit that need exactly. I don't know whether you'll run into the other need I had that required some special handling for full text search with legal documents: things like dates, case numbers, and statute cites were not handled well by default. What I did there was to pick those out with regular expression searches, put them into a space-separated string, cast that to tsvector, assign a higher weight to such key elements, and concatenate that tsvector with the one generated from the standard text parser and dictionaries. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 20 Oct 2015 21:57:59 +1100 rob stone <floriparob@gmail.com> wrote: > > Looking at this from an entirely different perspective, why are you not > using ICD codes to identify patient events? > It is a one to many relationship between patient and their events > identified by the relevant ICD code and date. > Given that MI has several applicable ICD codes you can use a select > along the lines of:- > WHERE icd_code IN ( . . . ) > > > I know it doesn't answer your question! It does indeed not answer my direct question, but it does offer an interesting perspecitive to be used on one of the nextphases of the medical application. Thanks for the heads-up! > Cheers, > Rob Cheers, Tim
On Tue, 20 Oct 2015 12:02:46 +0100 > Does the Thesaurus dictionary not do what you want? > > http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS Damn, I completely overlooked that one, and it indeed does seem to come very close to what I need in this use case. Thanksfor jolting my memory (also @Kevin) :) If I am not mistaken, this would be a valid thesaurus file: acute mi : heart attack mi : heart attack myocardial infarction : heart attack Multiple words on both ends, separated by a colon and each line being functional (a unique phrase linked to its more genericreplacement)? > Geoff Cheers, Tim
On Tuesday, October 20, 2015 7:56 PM, Tim van der Linden <tim@shisaa.jp> wrote: > On Tue, 20 Oct 2015 12:02:46 +0100 >> Does the Thesaurus dictionary not do what you want? >> >> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS > > Damn, I completely overlooked that one, and it indeed does seem > to come very close to what I need in this use case. I have to admit that the name of that dictionary type threw me off a bit at first. > If I am not mistaken, this would be a valid thesaurus file: > > acute mi : heart attack > mi : heart attack > myocardial infarction : heart attack > > Multiple words on both ends, separated by a colon and each line > being functional (a unique phrase linked to its more generic > replacement)? It has been a while, but my recollection is that I did something more like this: heart attack : heartattack acute mi : heartattack mi : heartattack myocardial infarction : heartattack If my memory is to be trusted, both the original words (whichever are actually in the document) and the "invented" synonym ("heartattack") will be in the tsvector/tsquery; this results in all *matching* but the identical wording being considered a *closer match*. As with most things, I encourage you to play around with it a bit to see what gives the best results for you. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 21 Oct 2015 13:40:38 +0000 (UTC) Kevin Grittner <kgrittn@ymail.com> wrote: > > Damn, I completely overlooked that one, and it indeed does seem > > to come very close to what I need in this use case. > > I have to admit that the name of that dictionary type threw me off > a bit at first. Indeed :) > > ... > > It has been a while, but my recollection is that I did something > more like this: > > heart attack : heartattack > acute mi : heartattack > mi : heartattack > myocardial infarction : heartattack > > If my memory is to be trusted, both the original words (whichever > are actually in the document) and the "invented" synonym > ("heartattack") will be in the tsvector/tsquery; this results in > all *matching* but the identical wording being considered a *closer > match*. Hmm, a very helpful insight and it indeed makes sense to convert each phrase into a "single word" mash-up so it can be lexemized. > As with most things, I encourage you to play around with it a bit > to see what gives the best results for you. Yes indeed and will do! Thank you very much for your help. If I get this up and running it might offer a nice opportunity to write a small post aboutthis to expand on my PostgreSQL series... > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Cheers, Tim