Thread: tsearch2 dictionary for statute cites

tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

10 March 2009, 18:47:43

I broached this topic last year[1], but the project got tabled until
now; so I raise it again.  We want to be able to search text
(extracted from character-based PDF files) which will contain legal
terms and statute cites, and we want to be able to do tsearch2
searches (under 8.3.recent).  It's clear enough how to create a
dictionary to gracefully handle the legal terms, but I'm less sure
about the statute cites.

I got one response[2], which mentioned a prefix search in the 8.4
release, and provided a link to a perl regular expression based
dictionary.  I'm wondering if anyone has feedback one either of these
techniques, and whether they might work for our needs.  I'm not sure I
adequately described our needs, so I'll fill that out a little more.

People are likely to search for statute cites, which tend to have a
hierarchical form.  I'm not sure the prefix approach will work for
this.  For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment.  If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes).  A search for
chapter 939 should return any of the above.

Of course, we want someone to be able to search on 939.64, 939.641,
and 939.645 and get documents which reference all of the above (i.e.,
to look for a document referring to a hate crime committed while
concealing identity and wearing a bulletproof garment).

Suggestions welcome on how to handle this user requirement.

-Kevin

[1] http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php
[2] http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php

Re: tsearch2 dictionary for statute cites

From

Tom Lane

Date:

10 March 2009, 21:30:59

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> People are likely to search for statute cites, which tend to have a
> hierarchical form.  I'm not sure the prefix approach will work for
> this.  For example, there is a section 939.64 in the state statutes
> dealing with commission of a crime while wearing a bulletproof
> garment.  If someone searches for that, they should find subsections
> like 939.64(1) or 939.64(2) but not different sections which start
> with the same characters like 939.641 (the section on concealing
> identity) or 939.645 (the section on hate crimes).  A search for
> chapter 939 should return any of the above.

I think what you need is a custom parser that treats these similarly to
hyphenated words.  If I pretend that the dot is a hyphen I get matching
behavior that seems to meet all those requirements.

Unfortunately we don't seem to have any really easy way to plug in a
custom parser, other than copy-paste-modify the existing one which would
be a PITA from a maintenance standpoint.  Perhaps you could pass the
texts and the queries through a regexp substitution that converts
digit-dot-digit to digit-dash-digit?

            regards, tom lane

Re: tsearch2 dictionary for statute cites

From

Oleg Bartunov

Date:

11 March 2009, 03:58:56

On Tue, 10 Mar 2009, Tom Lane wrote:

> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> People are likely to search for statute cites, which tend to have a
>> hierarchical form.  I'm not sure the prefix approach will work for
>> this.  For example, there is a section 939.64 in the state statutes
>> dealing with commission of a crime while wearing a bulletproof
>> garment.  If someone searches for that, they should find subsections
>> like 939.64(1) or 939.64(2) but not different sections which start
>> with the same characters like 939.641 (the section on concealing
>> identity) or 939.645 (the section on hate crimes).  A search for
>> chapter 939 should return any of the above.
>
> I think what you need is a custom parser that treats these similarly to
> hyphenated words.  If I pretend that the dot is a hyphen I get matching
> behavior that seems to meet all those requirements.
>
> Unfortunately we don't seem to have any really easy way to plug in a
> custom parser, other than copy-paste-modify the existing one which would
> be a PITA from a maintenance standpoint.  Perhaps you could pass the
> texts and the queries through a regexp substitution that converts
> digit-dot-digit to digit-dash-digit?

perhaps, for 8.4 it's better to utilize prefix search, like
to_tsquery('939.645:*') will find what Kevin need. The problem is with
parser, so I'd preprocess text before indexing to convert all
digit.digit(digit) to digit.digit.digit, which is what parser recognizes as
a single lexem 'version'.  Here is just an illustration

qq=# select * from ts_parse('default',translate('939.64(1)','()','. '));
  tokid |  token
-------+----------
      8 | 939.64.1
     12 |

btw, having 'version' it's possible to use dict_regex for 8.3.


>
>             regards, tom lane
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

11 March 2009, 11:01:41

>>> Oleg Bartunov <oleg@sai.msu.su> wrote:
> On Tue, 10 Mar 2009, Tom Lane wrote:
>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>>> People are likely to search for statute cites, which tend to have
a
>>> hierarchical form.  I'm not sure the prefix approach will work for
>>> this.  For example, there is a section 939.64 in the state
statutes
>>> dealing with commission of a crime while wearing a bulletproof
>>> garment.  If someone searches for that, they should find
subsections
>>> like 939.64(1) or 939.64(2) but not different sections which start
>>> with the same characters like 939.641 (the section on concealing
>>> identity) or 939.645 (the section on hate crimes).  A search for
>>> chapter 939 should return any of the above.
>>
>> Perhaps you could pass the texts and the queries through a regexp
>> substitution that converts digit-dot-digit to digit-dash-digit?
>
> perhaps, for 8.4 it's better to utilize prefix search, like
> to_tsquery('939.645:*') will find what Kevin need. The problem is
with
> parser, so I'd preprocess text before indexing to convert all
> digit.digit(digit) to digit.digit.digit, which is what parser
recognizes as
> a single lexem 'version'.  Here is just an illustration
>
> qq=# select * from ts_parse('default',translate('939.64(1)','()','.
'));
>   tokid |  token
> -------+----------
>       8 | 939.64.1
>      12 |
>
> btw, having 'version' it's possible to use dict_regex for 8.3.

Tom, Oleg: Thanks for the suggestions.  Looks promising.

-Kevin

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

06 April 2009, 17:52:40

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> People are likely to search for statute cites, which tend to have a
>> hierarchical form.

> I think what you need is a custom parser

I've just returned to this and after review have become convinced that
this is absolutely necessary; once the default parser has done its
work, figuring out the bounds of a statute cite would be next to
impossible.  Examples of the kind of fun you can have labeling
statutes, ordinances, and rules should you ever get elected to public
office:

10-3-350.10(1)(k)
10.1(40)(d)1
10.40.040(c)(2)
100.525(2)(a)3
105-10.G(3)(a)
11.04C.3.R.(1)
8.961.41(cm)
9.125.07(4A)(3)
947.013(1m)(a)

In any of these, a search string which exactly matches something up to
(but not including) a dash, dot, or left paren should find that thing.

> Unfortunately we don't seem to have any really easy way to plug in a
> custom parser, other than copy-paste-modify the existing one which
> would be a PITA from a maintenance standpoint.

I'm afraid I'm going to have to bite the bullet and do this anyway.
Any guidance on how to go about it may save me some time.  Also, if
there is any way to do this which may be useful to others or integrate
into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.

-Kevin

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

06 April 2009, 19:05:48

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Perhaps you could pass the texts and the queries through a regexp
> substitution that converts digit-dot-digit to digit-dash-digit?

This doesn't seem to get me anywhere.  For cite '9.125.07(4A)(3)'
I got this:

select ts_debug('9-125-07-4A-3');
                            ts_debug
----------------------------------------------------------------
 (uint,"Unsigned integer",9,{simple},simple,{9})
 (int,"Signed integer",-125,{simple},simple,{-125})
 (int,"Signed integer",-07,{simple},simple,{-07})
 (int,"Signed integer",-4,{simple},simple,{-4})
 (asciiword,"Word, all ASCII",A,{english_stem},english_stem,{})
 (int,"Signed integer",-3,{simple},simple,{-3})
(6 rows)

Would there be a reasonable generalized way to pick something like
this out of a body of text using dictionaries and treat it as a
statute cite?

-Kevin

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

06 April 2009, 20:04:46

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> regexp substitution

I found a way to at least keep the cite in one piece.  Perhaps I can
do the rest in custom dictionaries, which are more pluggable.

select ts_debug
('State Statute <cite value="SS9.125.07(4A)(3)"> pertaining to');
                                    ts_debug
--------------------------------------------------------------------------------
 (asciiword,"Word, all
ASCII",State,{english_stem},english_stem,{state})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all
ASCII",Statute,{english_stem},english_stem,{statut})
 (blank,"Space symbols"," ",{},,)
 (tag,"XML tag","<cite value=""SS9.125.07(4A)(3)"">",{},,)
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all
ASCII",pertaining,{english_stem},english_stem,{pertain})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",to,{english_stem},english_stem,{})
(9 rows)

-Kevin

Re: tsearch2 dictionary for statute cites

From

Oleg Bartunov

Date:

07 April 2009, 06:08:31

Kevin,

contrib/test_parser - an example parser code.

On Mon, 6 Apr 2009, Kevin Grittner wrote:

> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>>> People are likely to search for statute cites, which tend to have a
>>> hierarchical form.
>
>> I think what you need is a custom parser
>
> I've just returned to this and after review have become convinced that
> this is absolutely necessary; once the default parser has done its
> work, figuring out the bounds of a statute cite would be next to
> impossible.  Examples of the kind of fun you can have labeling
> statutes, ordinances, and rules should you ever get elected to public
> office:
>
> 10-3-350.10(1)(k)
> 10.1(40)(d)1
> 10.40.040(c)(2)
> 100.525(2)(a)3
> 105-10.G(3)(a)
> 11.04C.3.R.(1)
> 8.961.41(cm)
> 9.125.07(4A)(3)
> 947.013(1m)(a)
>
> In any of these, a search string which exactly matches something up to
> (but not including) a dash, dot, or left paren should find that thing.
>
>> Unfortunately we don't seem to have any really easy way to plug in a
>> custom parser, other than copy-paste-modify the existing one which
>> would be a PITA from a maintenance standpoint.
>
> I'm afraid I'm going to have to bite the bullet and do this anyway.
> Any guidance on how to go about it may save me some time.  Also, if
> there is any way to do this which may be useful to others or integrate
> into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.
>
> -Kevin
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

07 April 2009, 11:11:39

Oleg Bartunov <oleg@sai.msu.su> wrote:

> contrib/test_parser - an example parser code.

Thanks!  Sorry I missed that.

-Kevin

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

07 April 2009, 15:28:58

Oleg Bartunov <oleg@sai.msu.su> wrote:
> contrib/test_parser - an example parser code.

Using that as a template, I seem to be on track to use the regexp.c
code to pick out statute cites from the text in my start function, and
recognize when I'm positioned on one in my getlexeme (GETTOKEN)
function, delegating everything before, between, and after statute
cites to the default parser.  (I really didn't want to copy/paste and
modify the whole default parser.)

That leaves one question I'm still pretty fuzzy on -- how do I go
about having a statute cite in a tsquery match the entire statute cite
from a tsvector, or delimited leading portions of it, without having
it match shorter portions?

For example:

If the document text contains '341.15(3)' I want to find it with a
search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
'341.1', or '15'.  How do I handle that?  Do I have to build my
tsquery values myself as text and cast to tsquery, or is there
something more graceful that I'm missing?

-Kevin

Re: tsearch2 dictionary for statute cites

From

Oleg Bartunov

Date:

07 April 2009, 15:48:55

On Tue, 7 Apr 2009, Kevin Grittner wrote:

> If the document text contains '341.15(3)' I want to find it with a
> search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
> '341.1', or '15'.  How do I handle that?  Do I have to build my
> tsquery values myself as text and cast to tsquery, or is there
> something more graceful that I'm missing?

of course, you can build tsquery youself, but once your parser can
recognize your very own token 'xxx', it'd be much better to have
mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
For example, we have our dict_regex
http://vo.astronet.ru/arxiv/dict_regex.html

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

07 April 2009, 16:20:45

Oleg Bartunov <oleg@sai.msu.su> wrote:
> of course, you can build tsquery youself, but once your parser can
> recognize your very own token 'xxx', it'd be much better to have
> mapping xxx -> dict_xxx, where dict_xxx knows all semantics.

I probably just need to have that "Aha!" moment, slap my forehead, and
move on; but I'm not quite understanding something.  The answer to
this question could be it: Can I use a different set of dictionaries
for creating the tsquery than I did for the tsvector?

If so, I can have the dictionaries which generate the tsvector include
the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
dictionaries for the tsquery can only generate the token based on
exactly what the user typed.  That would give me exactly what I want,
but somehow I have gotten the impression that the tsvector and tsquery
need to be generated using the same dictionary set.

I hope that's a mistaken impression?

-Kevin

Re: tsearch2 dictionary for statute cites

From

Tom Lane

Date:

07 April 2009, 16:29:56

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Can I use a different set of dictionaries
> for creating the tsquery than I did for the tsvector?

Sure, as long as the tokens (normalized words) that they produce match
up for words that you want to have match.  Once the tokens come out,
they're just strings as far as the rest of the text search machinery
is concerned.

            regards, tom lane

Re: tsearch2 dictionary for statute cites

From

Oleg Bartunov

Date:

07 April 2009, 16:32:44

On Tue, 7 Apr 2009, Kevin Grittner wrote:

> Oleg Bartunov <oleg@sai.msu.su> wrote:
>> of course, you can build tsquery youself, but once your parser can
>> recognize your very own token 'xxx', it'd be much better to have
>> mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
>
> I probably just need to have that "Aha!" moment, slap my forehead, and
> move on; but I'm not quite understanding something.  The answer to
> this question could be it: Can I use a different set of dictionaries
> for creating the tsquery than I did for the tsvector?

Sure ! For example, you want to index all words, so your dictionaries
doesn't have stop word lists, but forbid people to search common words.
Or, if you want to search 'to be or not to be' you have to use
dictionaries without stop words.


>
> If so, I can have the dictionaries which generate the tsvector include
> the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
> dictionaries for the tsquery can only generate the token based on
> exactly what the user typed.  That would give me exactly what I want,
> but somehow I have gotten the impression that the tsvector and tsquery
> need to be generated using the same dictionary set.
>
> I hope that's a mistaken impression?

yes.

>
> -Kevin
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

07 April 2009, 16:33:25

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Can I use a different set of dictionaries
>> for creating the tsquery than I did for the tsvector?
>
> Sure, as long as the tokens (normalized words) that they produce
> match up for words that you want to have match.  Once the tokens
> come out, they're just strings as far as the rest of the text search
> machinery is concerned.

Fantastic!  Don't know how I got confused about that, but the way now
looks clear.

Thanks!

-Kevin

Re: SOLVED: tsearch2 dictionary for statute cites

From

"Kevin Grittner"

Date:

08 April 2009, 13:05:15

Oleg Bartunov <oleg@sai.msu.su> wrote:
>> I probably just need to have that "Aha!" moment, slap my forehead,
and
>> move on; but I'm not quite understanding something.  The answer to
>> this question could be it: Can I use a different set of
dictionaries
>> for creating the tsquery than I did for the tsvector?
>
> Sure ! For example, you want to index all words, so your
dictionaries
> doesn't have stop word lists, but forbid people to search common
words.
> Or, if you want to search 'to be or not to be' you have to use
> dictionaries without stop words.

I found a creative solution which I think meets my needs.  I'm posting
both to help out anyone with similar issues who finds the thread, and
in case someone sees an obvious defect.  By creating one function to
generate the "legal" tsvector (which recognizes statute cites) and
another function to generate the search values, with casts from text
to the ts objects, I can get more targeted results than the parser and
dictionary changes alone could give me.

I'm still working on the dictionaries and the query function, but the
vector function currently looks like the attached.

Thanks to Oleg and Tom for assistance; while neither suggested quite
this solution, their comments moved me along to where I found it.

-Kevin

Attachment

to_legal_tsvector.sql