Issue Supporting Emojis in Full Text Search on Ubuntu - Mailing list pgsql-novice
From | Jordan Hurwich |
---|---|
Subject | Issue Supporting Emojis in Full Text Search on Ubuntu |
Date | |
Msg-id | CAJKqsjs22F-sJVks96MO9PC9fZ-8FjK=GZOvQEvvoFVXCopbDw@mail.gmail.com Whole thread Raw |
Responses |
Re: Issue Supporting Emojis in Full Text Search on Ubuntu
|
List | pgsql-novice |
We rely on the Postgres tsvector implementation to enable full text search in our app, but we're having some issues with getting the parser to recognize emoji characters (like "😀" <U+1F600>) as anything other than "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters are recognized as "word"/"Word, all letters" characters on Mac; and non-english, non-emoji characters (like "我" <U+6211>) are recognized as "word" characters on both Mac and Ubuntu.
We greatly appreciate your feedback, debug details below and happy to provide more as requested,
Jordan
pulsasensors.com, jhurwich@
Platform:
- AWS Ubuntu 18.04.2 LTS vs MacOS 10.15.5
- postgres (PostgreSQL) 11.5
* ts_debug() differs on MacOS and Ubuntu *
We have not modified the 'english' text search configuration on either instance, however the query "SELECT * FROM ts_debug('english', '😀');" returns different results on MacOS 10.15.5 and our Ubuntu instance:
- on MacOS:
db=# select * from ts_debug('english', '😀');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 😀 | {english_stem} | english_stem | {😀}
- on Ubuntu:
db=# SELECT * from ts_debug('english','😀');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | 😀 | {} | |
Notably non-english, non-emoji characters like '我' behave as desired on both instances, with the same result on both MacOS and Ubuntu for "SELECT * FROM ts_debug('english', '我');":
db=# SELECT * FROM ts_debug('english', '我');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 我 | {english_stem} | english_stem | {我}
* pg_database *
There are minor differences between MacOS and Ubuntu in pg_database as follows, however modifications to set datcollate and datctype to 'C' on Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for ts_debug(). See row for 'testdb01':
- on Mac:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
datname | encoding | datcollate | datctype | datistemplate
----------------+----------+------------+----------+---------------
postgres | 6 | C | C | f
template0 | 6 | C | C | t
template1 | 6 | C | C | t
testdb01 | 6 | C | C | f
- on Ubuntu:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
datname | encoding | datcollate | datctype | datistemplate
-----------+----------+-------------+-------------+---------------
postgres | 6 | C.UTF-8 | C.UTF-8 | f
template0 | 6 | C.UTF-8 | C.UTF-8 | t
template1 | 6 | en_US.UTF-8 | en_US.UTF-8 | t
testdb01 | 6 | en_US.UTF-8 | en_US.UTF-8 | f
* locale *
The result of `$ locale` on both instances is similar, included below for Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items are returned while only 4 entries are returned on Ubuntu, included below:
- on Ubuntu
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
- on Ubuntu
$ locale -a
C
C.UTF-8
en_US.utf8
POSIX
We greatly appreciate your feedback, debug details below and happy to provide more as requested,
Jordan
pulsasensors.com, jhurwich@
Platform:
- AWS Ubuntu 18.04.2 LTS vs MacOS 10.15.5
- postgres (PostgreSQL) 11.5
* ts_debug() differs on MacOS and Ubuntu *
We have not modified the 'english' text search configuration on either instance, however the query "SELECT * FROM ts_debug('english', '😀');" returns different results on MacOS 10.15.5 and our Ubuntu instance:
- on MacOS:
db=# select * from ts_debug('english', '😀');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 😀 | {english_stem} | english_stem | {😀}
- on Ubuntu:
db=# SELECT * from ts_debug('english','😀');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | 😀 | {} | |
Notably non-english, non-emoji characters like '我' behave as desired on both instances, with the same result on both MacOS and Ubuntu for "SELECT * FROM ts_debug('english', '我');":
db=# SELECT * FROM ts_debug('english', '我');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 我 | {english_stem} | english_stem | {我}
* pg_database *
There are minor differences between MacOS and Ubuntu in pg_database as follows, however modifications to set datcollate and datctype to 'C' on Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for ts_debug(). See row for 'testdb01':
- on Mac:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
datname | encoding | datcollate | datctype | datistemplate
----------------+----------+------------+----------+---------------
postgres | 6 | C | C | f
template0 | 6 | C | C | t
template1 | 6 | C | C | t
testdb01 | 6 | C | C | f
- on Ubuntu:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
datname | encoding | datcollate | datctype | datistemplate
-----------+----------+-------------+-------------+---------------
postgres | 6 | C.UTF-8 | C.UTF-8 | f
template0 | 6 | C.UTF-8 | C.UTF-8 | t
template1 | 6 | en_US.UTF-8 | en_US.UTF-8 | t
testdb01 | 6 | en_US.UTF-8 | en_US.UTF-8 | f
* locale *
The result of `$ locale` on both instances is similar, included below for Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items are returned while only 4 entries are returned on Ubuntu, included below:
- on Ubuntu
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
- on Ubuntu
$ locale -a
C
C.UTF-8
en_US.utf8
POSIX
* Postgres installation *
On Mac, Postgres was installed and is managed by Homebrew via the "postgresql@11" formula.
On Ubuntu, Postgres was installed from source at https://ftp.postgresql.org/pub/source/v11.5/postgresql-11.5.tar.bz2.
pgsql-novice by date: