Home > mailing lists

Re: Unaccent performance - Mailing list pgsql-hackers

From	Thom Brown
Subject	Re: Unaccent performance
Date	June 21, 2013 21:52:46
Msg-id	CAA-aLv7MJCp-Lmn7WCW8LrQonvaNb-6jQqjxdSxbmJYzKuqb+w@mail.gmail.com Whole thread Raw
In response to	Unaccent performance (Thom Brown <thom@linux.com>)
Responses	Re: Unaccent performance
List	pgsql-hackers

Tree view

On 21 June 2013 19:04, Thom Brown <thom@linux.com> wrote:

Hi,

The unaccent extension is great, especially with its customisability, but it's not always easy to recommend. I witnessed a customer using no less than 56 nested replace functions in an SQL function. I looked to see how much this can be mitigated by unaccent. It turns out that not all the characters they were replacing can be replaced by unaccent, either because they replace more than 1 character at a time, or the character they're replacing, for some reason, isn't processed by unaccent, even with a custom rules file.

So there were 20 characters I could identify that they were replacing. I made a custom rules file and compared its performance to the difficult-to-manage set of nested replace calls.

CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
RETURNS text
LANGUAGE sql
IMMUTABLE
AS $function$
SELECT replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
;
$function$

postgres=# SELECT myunaccent(sometext::text) FROM (SELECT 'ÀÁÂÃÄÅÆàáâãäåæĀāĂăĄą' sometext FROM generate_series(1,1000000)) x OFFSET 999999 LIMIT 1;
myunaccent
----------------------
AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 726.282 ms
postgres=# SELECT unaccent(sometext::text) FROM (SELECT 'ÀÁÂÃÄÅÆàáâãäåæĀāĂăĄą' sometext FROM generate_series(1,1000000)) x OFFSET 999999 LIMIT 1;
unaccent
----------------------
AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 3305.252 ms

The timings are actually pretty much the same even if I introduce 187 nested replace calls for every line in the unaccent.rules file for 187 characters. But the same character set with unaccent increases to 7418.526 ms with the same type of query as above. That's 10 times more expensive.

Is there a way to boost the performance to make its adoption more palatable?

Another test passing in a string of 100000 characters gives the following timings:

unaccent: 240619.395 ms

myunaccent: 785.505 ms

I guess this must indicate that unaccent is processing all rows, and myunaccent is only being run on the 1 select row? I can't account for myunaccent always being almost the same duration regardless of string length otherwise. This is probably an incorrect assessment of performance.

Another test inserting long text strings into a text column of a table 100,000 times, then updating another column to have that unaccented value using both methods:

unaccent: 3867.306 ms

myunaccent: 43611.732 ms

So I guess this complaint about performance is all just noise.

However, pushing that pointless complaint to one side, I would like to have the ability to have unaccent support more characters that it doesn't currently seem to support, such as bullet points, ellipses etc., and also more than 1 character being replaced. Naturally these aren't appropriate to fall under the unaccent function itself, but the rules file is good starting point. It would be a bit like translate, except it would use a rules file instead of providing strings of single characters to convert.

So say we wanted "(trademark)" to be converted into "™" just as an example, or ";" to ".". We can't do that with unaccent, but in order to avoid a huge list of replace functions, a function like unaccent, with a few adaptations, would solve the problem.

e.g.:

SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')

would return

Commodore Amiga™.

This would ideally somehow cater for replacing tabs and spaces too.

--
Thom

pgsql-hackers by date:

From: Tom Lane
Date: 21 June 2013, 21:44:31
Subject: Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

From: Troels Nielsen
Date: 21 June 2013, 22:29:09
Subject: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

Re: Unaccent performance - Mailing list pgsql-hackers

Previous

Next