Thread: client libpq multibyte support
Hi, A client application using libpq made by non-MULTIBYTE can not talk to server made by MULTIBYTE. (Example) ------------------------------------------------------------A_server(non-MULTIBYTE) B_server(--enable-multibyte=EUC_JP) | | --+----------+----------+-- network | C_server(non-MULTIBYTE) By using the C_server's psql(+non-MULTIBYTE-libpq),prompt> psql -h B_serveradmin=# set client_encoding='SJIS';SET VARIABLEadmin=#\dt List of relations Name | Type | Owner ------------+-------+-------SJIS_KANJI | table | admin (1 row) admin=# select * from SJIS_KANJI ;\: extra argument ';' ignored\: extra argument ';' ignoredInvalid command \. Try \? forhelp. (Here, "SJIS_KANJI" is SJIS multibyte code.) ----------------------------------------------------------- Is this a specification ? I hope that a client 7.0-libpq and an application always be made by "configure --enable-multibyte" even if MULTIBYTE isn't necessary for backend. If so, the above problem will be solved. -- Regard, SAKAIDA Masaaki -- Osaka, Japan
SAKAIDA Masaaki <sakaida@psn.co.jp> writes: > A client application using libpq made by non-MULTIBYTE > can not talk to server made by MULTIBYTE. > admin=# select * from SJIS_KANJI ; > \: extra argument ';' ignored > \: extra argument ';' ignored > Invalid command \. Try \? for help. Ugh :-(. We have not seen this reported before --- do you know exactly where it's coming from? (I suspect it may be a psql issue not a libpq issue, but hard to say without more info.) > I hope that a client 7.0-libpq and an application always be > made by "configure --enable-multibyte" even if MULTIBYTE isn't > necessary for backend. If so, the above problem will be solved. I do not think that will go over well with people who don't need multibyte support, since the MULTIBYTE code is a good deal larger and slower. Also, AFAIK we didn't have any such problem in 6.5, so perhaps this is just a small bug not requiring such a sledgehammer solution. We need to look more closely. regards, tom lane
> > admin=# select * from SJIS_KANJI ; > > \: extra argument ';' ignored > > \: extra argument ';' ignored > > Invalid command \. Try \? for help. > > Ugh :-(. We have not seen this reported before --- do you know exactly > where it's coming from? (I suspect it may be a psql issue not a libpq > issue, but hard to say without more info.) That's because none-MB client does not understand how "Shift JIS kanji" consists of letters with different width bytes. The similar problem would happen with the Big5 character set (traditional Chinese), also. Unlike other character sets, these should be treated carefully since they include the same bit patterns as ASCII and that makes none-MB clients confused. > I do not think that will go over well with people who don't need > multibyte support, since the MULTIBYTE code is a good deal larger > and slower. Also, AFAIK we didn't have any such problem in 6.5, so > perhaps this is just a small bug not requiring such a sledgehammer > solution. We need to look more closely. No, 6.5 (and former versions) has exactly the same "bug." The reason why you didn't hear it by now is that just nobody had tried to mixed MB/none-MB backend/server configurations until Masaaki came up with pgbash:-) Anyway, I could hardly imagine that such configurations would actually exist in the real world. Masaaki, could you tell me what are the advantages or reasons of the configuration? For the Tom's comment of "the MULTIBYTE code is a good deal larger and slower": IMHO it's a price of i18n (I don't claim my implementation of MB is the most efficient one, though). Today almost any OS and applications are evolving to be "i18n ready." Look at Lamar's new RPM. The multibyte and the locale functionalities are now enabled by default in it. In the near future, PostgreSQL would have true i18n functionalities (NATIONAL CHARACTER and friends), and I look forward to join the work. I hope PostgreSQL would be i18n ready by default at that time. -- Tatsuo Ishii
> > That's because none-MB client does not understand how "Shift JIS > > kanji" consists of letters with different width bytes. The similar > > problem would happen with the Big5 character set (traditional > > Chinese), also. Unlike other character sets, these should be treated > > carefully since they include the same bit patterns as ASCII and that > > makes none-MB clients confused. > > I'm confused though, this would mean that somewhere in the string > `SJIS_KANJI' a backslash was found. But that's all ASCII characters. > Aren't the characters 0-127 always identical in any character set? Not always. Shift JIS and Big5 include 0-127 characters. So "how to distinguish them from ASCII?", you might ask. Here are rules for this: 1. parse from the begining byte of the string in question. If it is 0-127 then it's an ASCII (single byte letter). 2. if it's between 0xa1 and 0xdf, it's a "1 byte kana" (single byte letter). 3. otherwise it's a "kanji" (double byte letter). In this case the second byte might be in range of 0-127 (this is the source of the problem). I think Big5 has similar, but a little bit different rule (I don't remember precisely now). Other encodings having 0-127 range bytes (but they are not ASCII) include: o UCS-2, 4 (Unicode) o any 7 bit encoded ISO 2022 based charsets. for example, ISO 2022-jp. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > For the Tom's comment of "the MULTIBYTE code is a good deal larger and > slower": IMHO it's a price of i18n (I don't claim my implementation of > MB is the most efficient one, though). Today almost any OS and > applications are evolving to be "i18n ready." True, and in fact most of the performance problem in the client-side MULTIBYTE code comes from the fact that it's not designed-in, but tries to be a minimally intrusive patch. I think we could make it go faster if we accepted that it was standard functionality. So I'm not averse to going in that direction in the long term ... but I do object to turning on MULTIBYTE by default just a couple days before release. We don't really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server combination is, and so I'm afraid to make it the default configuration with hardly any testing. regards, tom lane
> True, and in fact most of the performance problem in the client-side > MULTIBYTE code comes from the fact that it's not designed-in, but tries > to be a minimally intrusive patch. I think we could make it go faster > if we accepted that it was standard functionality. So I'm not averse to > going in that direction in the long term ... Glad to hear that. > but I do object to turning > on MULTIBYTE by default just a couple days before release. We don't > really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server > combination is, and so I'm afraid to make it the default configuration > with hardly any testing. Agreed. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> wrote: > > > admin=# select * from SJIS_KANJI ; > > > \: extra argument ';' ignored > > > \: extra argument ';' ignored > > > Invalid command \. Try \? for help. > > (snip) > > That's because none-MB client does not understand how "Shift JIS > kanji" consists of letters with different width bytes. The similar > problem would happen with the Big5 character set (traditional > Chinese), also. Unlike other character sets, these should be treated > carefully since they include the same bit patterns as ASCII and that > makes none-MB clients confused. Thank you for your reply. (Probably, the direct cause of this error is PQmblen(). non-MULTIBYTE-PQmblen() always return "1". ) > Anyway, I could hardly imagine that such configurations > would actually exist in the real world. Masaaki, could you tell me > what are the advantages or reasons of the configuration? # My poor English won't be able to explain the real world ;-). If a client libpq always be made by "configure --enable- multibute", the advantages are 1. In the case of SQL_ASCII, a client application speed is almost equal to non-MULTIBYTE. And the MULTIBYTE code is not so larger. 2. When required, by using "set client_encoding=xxx", it is possible to use the MULTIBYTE at anytime. -- Regard, SAKAIDA Masaaki -- Osaka, Japan
> > True, and in fact most of the performance problem in the client-side > > MULTIBYTE code comes from the fact that it's not designed-in, but tries > > to be a minimally intrusive patch. I think we could make it go faster > > if we accepted that it was standard functionality. So I'm not averse to > > going in that direction in the long term ... > > Glad to hear that. > > > but I do object to turning > > on MULTIBYTE by default just a couple days before release. We don't > > really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server > > combination is, and so I'm afraid to make it the default configuration > > with hardly any testing. > > Agreed. Thank you for your challenge. I expect that a good result comes out. -- Regard, SAKAIDA Masaaki -- Osaka, Japan
Please allow me to pick out this thread again. > > True, and in fact most of the performance problem in the client-side > > MULTIBYTE code comes from the fact that it's not designed-in, but tries > > to be a minimally intrusive patch. I think we could make it go faster > > if we accepted that it was standard functionality. So I'm not averse to > > going in that direction in the long term ... I have checked the performance problem. (Environment)- Hardware : P200pro CPU, 128MB, 5400rpm disk- OS : Red hat Linux-5.2- Database version: postgresql-7.0RC1 (Tested software and data)- Library : libpq- Program : ecpg application program, psql- SQL : insert, select- Number of tuples : 100,000 tuples (Test case) (1) non-MULTIBYTE (2) MULTIBYTE encoding=SQL_ASCII An ecpg program and the psql were used in this test case. (Result) As for the result, there was no difference in the speed of (1) and (2). I could *not* find the performance problem. (Improvement) However, the performance problem may occur if the test of 10,000,000 tuples will be done. Because PQmblen() has a little overhead of routine-call. Therefore, if the MULTIBYTE PQmblen() will be changed as the following, the perfomance problem disappers *perfectly*. # ifdef MULTIBYTE int PQmblen(const unsigned char *s, int encoding){ if( encoding == SQL_ASCII ) return 1; <=======Added line return (pg_encoding_mblen(encoding, s)); } # endif (Conclusion) A client library/application should be made by "configure --enable-multibyte[=SQL_ASCII]" when postgresql is made by "configure [non-MULTIBYTE]". (Reference of library size) non-MULTIBYTE MULTIBYTE libpq.a 69KB 91KB libpq.so.2.0 52KB 52KB libpq.so.2.1 60KB 78KB -- Regard, SAKAIDA Masaaki -- Osaka, Japan