Thread: client libpq multibyte support

client libpq multibyte support

From

SAKAIDA Masaaki

Date:

04 May 2000, 22:34:18

Hi,
   A client application using libpq made by non-MULTIBYTE 
can not talk to server made by MULTIBYTE.

(Example)
------------------------------------------------------------A_server(non-MULTIBYTE)
B_server(--enable-multibyte=EUC_JP)         |                     |        --+----------+----------+-- network
          |                C_server(non-MULTIBYTE)
 
By using the C_server's psql(+non-MULTIBYTE-libpq),prompt> psql -h B_serveradmin=# set client_encoding='SJIS';SET
VARIABLEadmin=#\dt   List of relations  Name     | Type  | Owner
 
------------+-------+-------SJIS_KANJI | table | admin
(1 row)
admin=# select * from SJIS_KANJI ;\: extra argument ';' ignored\: extra argument ';' ignoredInvalid command \. Try \?
forhelp.        
 
(Here, "SJIS_KANJI" is SJIS multibyte code.)
-----------------------------------------------------------
 Is this a specification ?
 I hope that a client 7.0-libpq and an application always be 
made by "configure --enable-multibyte" even if MULTIBYTE isn't 
necessary for backend. If so, the above problem will be solved.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

Re: client libpq multibyte support

From

Tom Lane

Date:

04 May 2000, 23:42:03

SAKAIDA Masaaki <sakaida@psn.co.jp> writes:
>     A client application using libpq made by non-MULTIBYTE 
> can not talk to server made by MULTIBYTE.

>  admin=# select * from SJIS_KANJI ;
>  \: extra argument ';' ignored
>  \: extra argument ';' ignored
>  Invalid command \. Try \? for help.        

Ugh :-(.  We have not seen this reported before --- do you know exactly
where it's coming from?  (I suspect it may be a psql issue not a libpq
issue, but hard to say without more info.)

>   I hope that a client 7.0-libpq and an application always be 
> made by "configure --enable-multibyte" even if MULTIBYTE isn't 
> necessary for backend. If so, the above problem will be solved.

I do not think that will go over well with people who don't need
multibyte support, since the MULTIBYTE code is a good deal larger
and slower.  Also, AFAIK we didn't have any such problem in 6.5, so
perhaps this is just a small bug not requiring such a sledgehammer
solution.  We need to look more closely.
        regards, tom lane

Re: client libpq multibyte support

From

Tatsuo Ishii

Date:

05 May 2000, 04:13:08

> >  admin=# select * from SJIS_KANJI ;
> >  \: extra argument ';' ignored
> >  \: extra argument ';' ignored
> >  Invalid command \. Try \? for help.        
> 
> Ugh :-(.  We have not seen this reported before --- do you know exactly
> where it's coming from?  (I suspect it may be a psql issue not a libpq
> issue, but hard to say without more info.)

That's because none-MB client does not understand how "Shift JIS
kanji" consists of letters with different width bytes. The similar
problem would happen with the Big5 character set (traditional
Chinese), also. Unlike other character sets, these should be treated
carefully since they include the same bit patterns as ASCII and that
makes none-MB clients confused.

> I do not think that will go over well with people who don't need
> multibyte support, since the MULTIBYTE code is a good deal larger
> and slower.  Also, AFAIK we didn't have any such problem in 6.5, so
> perhaps this is just a small bug not requiring such a sledgehammer
> solution.  We need to look more closely.

No, 6.5 (and former versions) has exactly the same "bug." The reason
why you didn't hear it by now is that just nobody had tried to mixed
MB/none-MB backend/server configurations until Masaaki came up with
pgbash:-) Anyway, I could hardly imagine that such configurations
would actually exist in the real world. Masaaki, could you tell me
what are the advantages or reasons of the configuration?

For the Tom's comment of "the MULTIBYTE code is a good deal larger and
slower": IMHO it's a price of i18n (I don't claim my implementation of
MB is the most efficient one, though). Today almost any OS and
applications are evolving to be "i18n ready." Look at Lamar's new RPM. 
The multibyte and the locale functionalities are now enabled by
default in it.

In the near future, PostgreSQL would have true i18n functionalities
(NATIONAL CHARACTER and friends), and I look forward to join the work. 
I hope PostgreSQL would be i18n ready by default at that time.
--
Tatsuo Ishii

Re: client libpq multibyte support

From

Tatsuo Ishii

Date:

05 May 2000, 06:37:08

> > That's because none-MB client does not understand how "Shift JIS
> > kanji" consists of letters with different width bytes. The similar
> > problem would happen with the Big5 character set (traditional
> > Chinese), also. Unlike other character sets, these should be treated
> > carefully since they include the same bit patterns as ASCII and that
> > makes none-MB clients confused.
> 
> I'm confused though, this would mean that somewhere in the string
> `SJIS_KANJI' a backslash was found. But that's all ASCII characters.
> Aren't the characters 0-127 always identical in any character set?

Not always. Shift JIS and Big5 include 0-127 characters. So "how to
distinguish them from ASCII?", you might ask. Here are rules for this:

1. parse from the begining byte of the string in question. If it is
0-127 then it's an ASCII (single byte letter).

2. if it's between 0xa1 and 0xdf, it's a "1 byte kana" (single byte
letter).

3. otherwise it's a "kanji" (double byte letter). In this case the
second byte might be in range of 0-127 (this is the source of the
problem).

I think Big5 has similar, but a little bit different rule (I don't
remember precisely now).

Other encodings having 0-127 range bytes (but they are not ASCII)
include:

o UCS-2, 4 (Unicode)

o any 7 bit encoded ISO 2022 based charsets. for example, ISO 2022-jp.
--
Tatsuo Ishii

Re: client libpq multibyte support

From

Tom Lane

Date:

05 May 2000, 10:35:16

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> For the Tom's comment of "the MULTIBYTE code is a good deal larger and
> slower": IMHO it's a price of i18n (I don't claim my implementation of
> MB is the most efficient one, though). Today almost any OS and
> applications are evolving to be "i18n ready."

True, and in fact most of the performance problem in the client-side
MULTIBYTE code comes from the fact that it's not designed-in, but tries
to be a minimally intrusive patch.  I think we could make it go faster
if we accepted that it was standard functionality.  So I'm not averse to
going in that direction in the long term ... but I do object to turning
on MULTIBYTE by default just a couple days before release.  We don't
really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
combination is, and so I'm afraid to make it the default configuration
with hardly any testing.
        regards, tom lane

Re: client libpq multibyte support

From

Tatsuo Ishii

Date:

05 May 2000, 11:18:11

> True, and in fact most of the performance problem in the client-side
> MULTIBYTE code comes from the fact that it's not designed-in, but tries
> to be a minimally intrusive patch.  I think we could make it go faster
> if we accepted that it was standard functionality.  So I'm not averse to
> going in that direction in the long term ... 

Glad to hear that.

> but I do object to turning
> on MULTIBYTE by default just a couple days before release.  We don't
> really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
> combination is, and so I'm afraid to make it the default configuration
> with hardly any testing.

Agreed.
--
Tatsuo Ishii

Re: client libpq multibyte support

From

SAKAIDA Masaaki

Date:

05 May 2000, 11:57:13

Tatsuo Ishii <t-ishii@sra.co.jp> wrote:

> > >  admin=# select * from SJIS_KANJI ;
> > >  \: extra argument ';' ignored
> > >  \: extra argument ';' ignored
> > >  Invalid command \. Try \? for help.        
> >
(snip)
> 
> That's because none-MB client does not understand how "Shift JIS
> kanji" consists of letters with different width bytes. The similar
> problem would happen with the Big5 character set (traditional
> Chinese), also. Unlike other character sets, these should be treated
> carefully since they include the same bit patterns as ASCII and that
> makes none-MB clients confused.
 Thank you for your reply.

(Probably, the direct cause of this error is PQmblen(). non-MULTIBYTE-PQmblen() always return "1". )

>             Anyway, I could hardly imagine that such configurations
> would actually exist in the real world.  Masaaki, could you tell me
> what are the advantages or reasons of the configuration?

# My poor English won't be able to explain the real world ;-).
 If a client libpq always be made by "configure --enable-
multibute",  the advantages are 
 1. In the case of SQL_ASCII,  a client application speed is     almost equal to non-MULTIBYTE. And the MULTIBYTE code
is    not so larger. 2. When required, by using "set client_encoding=xxx", it is     possible to use the MULTIBYTE at
anytime.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

Re: client libpq multibyte support

From

SAKAIDA Masaaki

Date:

05 May 2000, 12:08:12

> > True, and in fact most of the performance problem in the client-side
> > MULTIBYTE code comes from the fact that it's not designed-in, but tries
> > to be a minimally intrusive patch.  I think we could make it go faster
> > if we accepted that it was standard functionality.  So I'm not averse to
> > going in that direction in the long term ... 
> 
> Glad to hear that.
> 
> > but I do object to turning
> > on MULTIBYTE by default just a couple days before release.  We don't
> > really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
> > combination is, and so I'm afraid to make it the default configuration
> > with hardly any testing.
> 
> Agreed.
 Thank you for your challenge. I expect that a good result comes out.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

Re: client libpq multibyte support

From

SAKAIDA Masaaki

Date:

07 May 2000, 11:02:22

Please allow me to pick out this thread again.

> > True, and in fact most of the performance problem in the client-side
> > MULTIBYTE code comes from the fact that it's not designed-in, but tries
> > to be a minimally intrusive patch.  I think we could make it go faster
> > if we accepted that it was standard functionality.  So I'm not averse to
> > going in that direction in the long term ... 
 I have checked the performance problem.

(Environment)-  Hardware         : P200pro CPU, 128MB, 5400rpm disk-  OS               : Red hat Linux-5.2-  Database
version: postgresql-7.0RC1
 

(Tested software and data)-  Library          : libpq-  Program          : ecpg application program, psql-  SQL
    : insert, select-  Number of tuples : 100,000 tuples
 

(Test case) (1) non-MULTIBYTE (2) MULTIBYTE encoding=SQL_ASCII
 An ecpg program and the psql were used in this test case. 

(Result) As for the result, there was no difference in the speed of (1) 
and  (2).  I could *not* find the performance problem.

(Improvement) However, the performance problem may occur if the test of 
10,000,000 tuples will be done. Because PQmblen() has a little 
overhead of routine-call. Therefore, if the MULTIBYTE PQmblen() 
will be changed as the following, the perfomance problem disappers 
*perfectly*.
 # ifdef MULTIBYTE int PQmblen(const unsigned char *s, int encoding){       if( encoding == SQL_ASCII ) return 1;
<=======Added line       return (pg_encoding_mblen(encoding, s)); } # endif
 

(Conclusion) A client library/application should be made by "configure 
--enable-multibyte[=SQL_ASCII]" when postgresql is made by 
"configure [non-MULTIBYTE]".


(Reference of library size)             non-MULTIBYTE  MULTIBYTE
libpq.a           69KB         91KB
libpq.so.2.0      52KB         52KB
libpq.so.2.1      60KB         78KB

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan