Thread: tsearch2 dictionary for statute cites
I broached this topic last year[1], but the project got tabled until now; so I raise it again. We want to be able to search text (extracted from character-based PDF files) which will contain legal terms and statute cites, and we want to be able to do tsearch2 searches (under 8.3.recent). It's clear enough how to create a dictionary to gracefully handle the legal terms, but I'm less sure about the statute cites. I got one response[2], which mentioned a prefix search in the 8.4 release, and provided a link to a perl regular expression based dictionary. I'm wondering if anyone has feedback one either of these techniques, and whether they might work for our needs. I'm not sure I adequately described our needs, so I'll fill that out a little more. People are likely to search for statute cites, which tend to have a hierarchical form. I'm not sure the prefix approach will work for this. For example, there is a section 939.64 in the state statutes dealing with commission of a crime while wearing a bulletproof garment. If someone searches for that, they should find subsections like 939.64(1) or 939.64(2) but not different sections which start with the same characters like 939.641 (the section on concealing identity) or 939.645 (the section on hate crimes). A search for chapter 939 should return any of the above. Of course, we want someone to be able to search on 939.64, 939.641, and 939.645 and get documents which reference all of the above (i.e., to look for a document referring to a hate crime committed while concealing identity and wearing a bulletproof garment). Suggestions welcome on how to handle this user requirement. -Kevin [1] http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php [2] http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > People are likely to search for statute cites, which tend to have a > hierarchical form. I'm not sure the prefix approach will work for > this. For example, there is a section 939.64 in the state statutes > dealing with commission of a crime while wearing a bulletproof > garment. If someone searches for that, they should find subsections > like 939.64(1) or 939.64(2) but not different sections which start > with the same characters like 939.641 (the section on concealing > identity) or 939.645 (the section on hate crimes). A search for > chapter 939 should return any of the above. I think what you need is a custom parser that treats these similarly to hyphenated words. If I pretend that the dot is a hyphen I get matching behavior that seems to meet all those requirements. Unfortunately we don't seem to have any really easy way to plug in a custom parser, other than copy-paste-modify the existing one which would be a PITA from a maintenance standpoint. Perhaps you could pass the texts and the queries through a regexp substitution that converts digit-dot-digit to digit-dash-digit? regards, tom lane
On Tue, 10 Mar 2009, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> People are likely to search for statute cites, which tend to have a >> hierarchical form. I'm not sure the prefix approach will work for >> this. For example, there is a section 939.64 in the state statutes >> dealing with commission of a crime while wearing a bulletproof >> garment. If someone searches for that, they should find subsections >> like 939.64(1) or 939.64(2) but not different sections which start >> with the same characters like 939.641 (the section on concealing >> identity) or 939.645 (the section on hate crimes). A search for >> chapter 939 should return any of the above. > > I think what you need is a custom parser that treats these similarly to > hyphenated words. If I pretend that the dot is a hyphen I get matching > behavior that seems to meet all those requirements. > > Unfortunately we don't seem to have any really easy way to plug in a > custom parser, other than copy-paste-modify the existing one which would > be a PITA from a maintenance standpoint. Perhaps you could pass the > texts and the queries through a regexp substitution that converts > digit-dot-digit to digit-dash-digit? perhaps, for 8.4 it's better to utilize prefix search, like to_tsquery('939.645:*') will find what Kevin need. The problem is with parser, so I'd preprocess text before indexing to convert all digit.digit(digit) to digit.digit.digit, which is what parser recognizes as a single lexem 'version'. Here is just an illustration qq=# select * from ts_parse('default',translate('939.64(1)','()','. ')); tokid | token -------+---------- 8 | 939.64.1 12 | btw, having 'version' it's possible to use dict_regex for 8.3. > > regards, tom lane > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
>>> Oleg Bartunov <oleg@sai.msu.su> wrote: > On Tue, 10 Mar 2009, Tom Lane wrote: >> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >>> People are likely to search for statute cites, which tend to have a >>> hierarchical form. I'm not sure the prefix approach will work for >>> this. For example, there is a section 939.64 in the state statutes >>> dealing with commission of a crime while wearing a bulletproof >>> garment. If someone searches for that, they should find subsections >>> like 939.64(1) or 939.64(2) but not different sections which start >>> with the same characters like 939.641 (the section on concealing >>> identity) or 939.645 (the section on hate crimes). A search for >>> chapter 939 should return any of the above. >> >> Perhaps you could pass the texts and the queries through a regexp >> substitution that converts digit-dot-digit to digit-dash-digit? > > perhaps, for 8.4 it's better to utilize prefix search, like > to_tsquery('939.645:*') will find what Kevin need. The problem is with > parser, so I'd preprocess text before indexing to convert all > digit.digit(digit) to digit.digit.digit, which is what parser recognizes as > a single lexem 'version'. Here is just an illustration > > qq=# select * from ts_parse('default',translate('939.64(1)','()','. ')); > tokid | token > -------+---------- > 8 | 939.64.1 > 12 | > > btw, having 'version' it's possible to use dict_regex for 8.3. Tom, Oleg: Thanks for the suggestions. Looks promising. -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> People are likely to search for statute cites, which tend to have a >> hierarchical form. > I think what you need is a custom parser I've just returned to this and after review have become convinced that this is absolutely necessary; once the default parser has done its work, figuring out the bounds of a statute cite would be next to impossible. Examples of the kind of fun you can have labeling statutes, ordinances, and rules should you ever get elected to public office: 10-3-350.10(1)(k) 10.1(40)(d)1 10.40.040(c)(2) 100.525(2)(a)3 105-10.G(3)(a) 11.04C.3.R.(1) 8.961.41(cm) 9.125.07(4A)(3) 947.013(1m)(a) In any of these, a search string which exactly matches something up to (but not including) a dash, dot, or left paren should find that thing. > Unfortunately we don't seem to have any really easy way to plug in a > custom parser, other than copy-paste-modify the existing one which > would be a PITA from a maintenance standpoint. I'm afraid I'm going to have to bite the bullet and do this anyway. Any guidance on how to go about it may save me some time. Also, if there is any way to do this which may be useful to others or integrate into PostgreSQL to reduce the long-term PITA aspect, I'm all ears. -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perhaps you could pass the texts and the queries through a regexp > substitution that converts digit-dot-digit to digit-dash-digit? This doesn't seem to get me anywhere. For cite '9.125.07(4A)(3)' I got this: select ts_debug('9-125-07-4A-3'); ts_debug ---------------------------------------------------------------- (uint,"Unsigned integer",9,{simple},simple,{9}) (int,"Signed integer",-125,{simple},simple,{-125}) (int,"Signed integer",-07,{simple},simple,{-07}) (int,"Signed integer",-4,{simple},simple,{-4}) (asciiword,"Word, all ASCII",A,{english_stem},english_stem,{}) (int,"Signed integer",-3,{simple},simple,{-3}) (6 rows) Would there be a reasonable generalized way to pick something like this out of a body of text using dictionaries and treat it as a statute cite? -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > regexp substitution I found a way to at least keep the cite in one piece. Perhaps I can do the rest in custom dictionaries, which are more pluggable. select ts_debug ('State Statute <cite value="SS9.125.07(4A)(3)"> pertaining to'); ts_debug -------------------------------------------------------------------------------- (asciiword,"Word, all ASCII",State,{english_stem},english_stem,{state}) (blank,"Space symbols"," ",{},,) (asciiword,"Word, all ASCII",Statute,{english_stem},english_stem,{statut}) (blank,"Space symbols"," ",{},,) (tag,"XML tag","<cite value=""SS9.125.07(4A)(3)"">",{},,) (blank,"Space symbols"," ",{},,) (asciiword,"Word, all ASCII",pertaining,{english_stem},english_stem,{pertain}) (blank,"Space symbols"," ",{},,) (asciiword,"Word, all ASCII",to,{english_stem},english_stem,{}) (9 rows) -Kevin
Kevin, contrib/test_parser - an example parser code. On Mon, 6 Apr 2009, Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >>> People are likely to search for statute cites, which tend to have a >>> hierarchical form. > >> I think what you need is a custom parser > > I've just returned to this and after review have become convinced that > this is absolutely necessary; once the default parser has done its > work, figuring out the bounds of a statute cite would be next to > impossible. Examples of the kind of fun you can have labeling > statutes, ordinances, and rules should you ever get elected to public > office: > > 10-3-350.10(1)(k) > 10.1(40)(d)1 > 10.40.040(c)(2) > 100.525(2)(a)3 > 105-10.G(3)(a) > 11.04C.3.R.(1) > 8.961.41(cm) > 9.125.07(4A)(3) > 947.013(1m)(a) > > In any of these, a search string which exactly matches something up to > (but not including) a dash, dot, or left paren should find that thing. > >> Unfortunately we don't seem to have any really easy way to plug in a >> custom parser, other than copy-paste-modify the existing one which >> would be a PITA from a maintenance standpoint. > > I'm afraid I'm going to have to bite the bullet and do this anyway. > Any guidance on how to go about it may save me some time. Also, if > there is any way to do this which may be useful to others or integrate > into PostgreSQL to reduce the long-term PITA aspect, I'm all ears. > > -Kevin > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> wrote: > contrib/test_parser - an example parser code. Thanks! Sorry I missed that. -Kevin
Oleg Bartunov <oleg@sai.msu.su> wrote: > contrib/test_parser - an example parser code. Using that as a template, I seem to be on track to use the regexp.c code to pick out statute cites from the text in my start function, and recognize when I'm positioned on one in my getlexeme (GETTOKEN) function, delegating everything before, between, and after statute cites to the default parser. (I really didn't want to copy/paste and modify the whole default parser.) That leaves one question I'm still pretty fuzzy on -- how do I go about having a statute cite in a tsquery match the entire statute cite from a tsvector, or delimited leading portions of it, without having it match shorter portions? For example: If the document text contains '341.15(3)' I want to find it with a search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)', '341.1', or '15'. How do I handle that? Do I have to build my tsquery values myself as text and cast to tsquery, or is there something more graceful that I'm missing? -Kevin
On Tue, 7 Apr 2009, Kevin Grittner wrote: > If the document text contains '341.15(3)' I want to find it with a > search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)', > '341.1', or '15'. How do I handle that? Do I have to build my > tsquery values myself as text and cast to tsquery, or is there > something more graceful that I'm missing? of course, you can build tsquery youself, but once your parser can recognize your very own token 'xxx', it'd be much better to have mapping xxx -> dict_xxx, where dict_xxx knows all semantics. For example, we have our dict_regex http://vo.astronet.ru/arxiv/dict_regex.html Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> wrote: > of course, you can build tsquery youself, but once your parser can > recognize your very own token 'xxx', it'd be much better to have > mapping xxx -> dict_xxx, where dict_xxx knows all semantics. I probably just need to have that "Aha!" moment, slap my forehead, and move on; but I'm not quite understanding something. The answer to this question could be it: Can I use a different set of dictionaries for creating the tsquery than I did for the tsvector? If so, I can have the dictionaries which generate the tsvector include the appropriate leading tokens ('341', '341.15', '341.15(3)') and the dictionaries for the tsquery can only generate the token based on exactly what the user typed. That would give me exactly what I want, but somehow I have gotten the impression that the tsvector and tsquery need to be generated using the same dictionary set. I hope that's a mistaken impression? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Can I use a different set of dictionaries > for creating the tsquery than I did for the tsvector? Sure, as long as the tokens (normalized words) that they produce match up for words that you want to have match. Once the tokens come out, they're just strings as far as the rest of the text search machinery is concerned. regards, tom lane
On Tue, 7 Apr 2009, Kevin Grittner wrote: > Oleg Bartunov <oleg@sai.msu.su> wrote: >> of course, you can build tsquery youself, but once your parser can >> recognize your very own token 'xxx', it'd be much better to have >> mapping xxx -> dict_xxx, where dict_xxx knows all semantics. > > I probably just need to have that "Aha!" moment, slap my forehead, and > move on; but I'm not quite understanding something. The answer to > this question could be it: Can I use a different set of dictionaries > for creating the tsquery than I did for the tsvector? Sure ! For example, you want to index all words, so your dictionaries doesn't have stop word lists, but forbid people to search common words. Or, if you want to search 'to be or not to be' you have to use dictionaries without stop words. > > If so, I can have the dictionaries which generate the tsvector include > the appropriate leading tokens ('341', '341.15', '341.15(3)') and the > dictionaries for the tsquery can only generate the token based on > exactly what the user typed. That would give me exactly what I want, > but somehow I have gotten the impression that the tsvector and tsquery > need to be generated using the same dictionary set. > > I hope that's a mistaken impression? yes. > > -Kevin > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Can I use a different set of dictionaries >> for creating the tsquery than I did for the tsvector? > > Sure, as long as the tokens (normalized words) that they produce > match up for words that you want to have match. Once the tokens > come out, they're just strings as far as the rest of the text search > machinery is concerned. Fantastic! Don't know how I got confused about that, but the way now looks clear. Thanks! -Kevin
Oleg Bartunov <oleg@sai.msu.su> wrote: >> I probably just need to have that "Aha!" moment, slap my forehead, and >> move on; but I'm not quite understanding something. The answer to >> this question could be it: Can I use a different set of dictionaries >> for creating the tsquery than I did for the tsvector? > > Sure ! For example, you want to index all words, so your dictionaries > doesn't have stop word lists, but forbid people to search common words. > Or, if you want to search 'to be or not to be' you have to use > dictionaries without stop words. I found a creative solution which I think meets my needs. I'm posting both to help out anyone with similar issues who finds the thread, and in case someone sees an obvious defect. By creating one function to generate the "legal" tsvector (which recognizes statute cites) and another function to generate the search values, with casts from text to the ts objects, I can get more targeted results than the parser and dictionary changes alone could give me. I'm still working on the dictionaries and the query function, but the vector function currently looks like the attached. Thanks to Oleg and Tom for assistance; while neither suggested quite this solution, their comments moved me along to where I found it. -Kevin