Tsearch vs Snowball, or what's a source file? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Tsearch vs Snowball, or what's a source file? |
Date | |
Msg-id | 26653.1180814748@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Tsearch vs Snowball, or what's a source file?
|
List | pgsql-hackers |
While looking at the tsearch-in-core patch I was distressed to notice that a good fraction of it is derived files, bearing notices such as /* This file was generated automatically by the Snowball to ANSI C compiler */ Our normal policy is "no derived files in CVS", so I went looking to see if we couldn't avoid that. I now see that contrib/tsearch2 has been doing the same thing for awhile, and it's risen up to bite us before, eg http://archives.postgresql.org/pgsql-committers/2005-09/msg00137.php I had not previously known anything about Snowball, but after perusing their websitehttp://snowball.tartarus.org/ for a bit, I believe the following is an accurate summary: 1. The original word-stemming algorithms are written in a special language "Snowball". You can get both the Snowball compiler and the original ".sbl" source files off the Snowball site, but these files are not those. 2. The Snowball people also distribute a "pre-compiled" version of their stuff, ie, the results of generating ANSI C code from all the stemming algorithms. They call this distribution "libstemmer". 3. What we've been distributing in contrib/tsearch2/snowball is a severely cut-back subset of libstemmer, ie, just the English and Russian stemmers. This accounts for the occasional complaints in the mailing lists from people who were trying to add other stemmers from the libstemmer distribution (and running into version-skew problems, because the version we're using is not very up-to-date). 4. The proposed tsearch-in-core patch includes a larger subset of libstemmer, but it's still not the whole thing, and it still seems to be a modified copy rather than an exact one. There isn't any part of this that seems to me to be a good idea. Arguably we should be relying on the original .sbl files, but that would make the Snowball compiler a required tool for building distributions, which is a dependency I for one don't want to add. In any case there's probably not a lot of practical difference between relying on the Snowball project's .sbl files and relying on their libstemmer distribution. Either way, we are importing someone else's sources. (At least they're BSD-license sources...) What I definitely *don't* like is that we've whacked the fileset around in ways that make it hard for someone to drop in a newer version of the upstream sources. The filenames don't match, the directory layout doesn't match, and to add insult to injury we've plastered our copyright on their files. Following the precedent of the zic timezone files would suggest dropping an *unmodified* copy of the libstemmer distro into its own subdirectory of our CVS, and doing whatever we have to do to compile it without any changes, so that we can drop in updates later without creating problems. (This is, in fact, what the Snowball people recommend for incorporating their code into a larger application.) OTOH, keeping our copy of the zic files up-to-date has proven to be a significant pain in the neck, and so I'm not sure I care to follow that precedent exactly. The Snowball files may not change as often as politicians invent new timezone laws, but they seem to change regularly enough --- the libstemmer tarball I just downloaded from their website seems to have been generated barely a week ago, and no it doesn't match what's in the patch now. Is there a reasonable way to treat libstemmer as an external library? regards, tom lane
pgsql-hackers by date: