Thread: Multilang text search. Is this correct?
In my table (mix of text and jsonb columns ) I have text in multiple languages.
I’d like search in all the possible regconfigs, so I’ve come up with the following recipe:
CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS
$$
SELECT to_tsvector('english', $1) ||
to_tsvector('greek', $1) ||
to_tsvector('simple', $1)
$$ LANGUAGE sql IMMUTABLE;
CREATE FUNCTION jsonb_to_tsvector_multilang(jsonb, jsonb) RETURNS tsvector AS
$$
SELECT jsonb_to_tsvector('english', $1, $2) ||
jsonb_to_tsvector('simple', $1, $2) ||
jsonb_to_tsvector('greek', $1, $2)
$$ LANGUAGE sql IMMUTABLE;
CREATE FUNCTION to_tsquery_multilang(query text) RETURNS tsquery AS
$$
SELECT websearch_to_tsquery('english', query) ||
websearch_to_tsquery('simple', query) ||
websearch_to_tsquery('greek', query)
$$ LANGUAGE sql IMMUTABLE;
Thus, in searching I use to_tsvector_multilang(’TEXT') @@ to_tsquery_multilang(‘QUERY’);
Looks like it’s working as expected from my prelim tests, but anyone seeing any potential pitfalls?
Note: I’m using GIN & RUM indices as well.
Florents Tselai <florents.tselai@gmail.com> writes: > In my table (mix of text and jsonb columns ) I have text in multiple languages. > I’d like search in all the possible regconfigs, so I’ve come up with the following recipe: > CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS > $$ > SELECT to_tsvector('english', $1) || > to_tsvector('greek', $1) || > to_tsvector('simple', $1) > $$ LANGUAGE sql IMMUTABLE; > Looks like it’s working as expected from my prelim tests, but anyone seeing any potential pitfalls? Well, it'll work for some value of "work", but your tsvectors and associated indexes will be bloated, and you're likely to get false matches. (For example, common words like "the" will get indexed and treated as matchable unless they're stopwords in all three configurations.) The usual compromise is to use just one config --- either 'simple', or a specific language if you have a preponderance of data in that language --- and tolerate the fact that searches won't be too bright about things like reducing stemmed words. regards, tom lane