Re: libpq compression - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: libpq compression |
Date | |
Msg-id | 3b1d2752-b7b5-db39-84c9-4ff024041fff@postgrespro.ru Whole thread Raw |
In response to | Re: libpq compression (Andres Freund <andres@anarazel.de>) |
Responses |
Re: libpq compression
|
List | pgsql-hackers |
Hi, Thank for review. On 28.10.2020 22:27, Andres Freund wrote: > I don't see a corresponding configure.ac change? Shame on me - I completely forgot that configure is actually generate from configure.ac. Fixed. > >> + <varlistentry id="libpq-connect-compression" xreflabel="compression"> >> + <term><literal>compression</literal></term> >> + <listitem> >> + <para> >> + Request compression of libpq traffic. Client sends to the server list of compression algorithms, supported byclient library. >> + If server supports one of this algorithms, then it acknowledges use of this algorithm and then all libpq messagessend both from client to server and >> + visa versa will be compressed. If server is not supporting any of the suggested algorithms, then it replies with'n' (no compression) >> + message and it is up to the client whether to continue work without compression or report error. >> + Supported compression algorithms are chosen at configure time. Right now two libraries are supported: zlib (default)and zstd (if Postgres was >> + configured with --with-zstd option). In both cases streaming mode is used. >> + </para> >> + </listitem> >> + </varlistentry> >> + > > - there should be a reference to potential security impacts when used in > combination with encrypted connections Done > - What does " and it is up to the client whether to continue work > without compression or report error" actually mean for a libpq parameter? It can not happen. The client request from server use of compressed protocol only if "compression=XXX" was specified in connection string. But XXX should be supported by client, otherwise this request will be rejected. So supported protocol string sent by client can never be empty. > - What is the point of the "streaming mode" reference? There are ways of performing compression: - block mode when each block is individually compressed (compressor stores dictionary in each compressed blocked) - stream mode Block mode allows to independently decompress each page. It is good for implementing page or field compression (as pglz is used to compress toast values). But it is not efficient for compressing client-server protocol commands. It seems to me to be important to explain that libpq is using stream mode and why there is no pglz compressor > Why are compression methods identified by one byte identifiers? That > seems unnecessarily small, given this is commonly a once-per-connection > action? It is mostly for simplicity of implementation: it is always simple to work with fixed size messages (or with array of chars rather than array of strings). And I do not think that it can somehow decrease flexibility: this one-letter algorihth codes are not visible for user. And I do not think that we sometime will support more than 127 (or even 64 different compression algorithms). > The protocol sounds to me like there's no way to enable/disable > compression in an existing connection. To me it seems better to have an > explicit, client initiated, request to use a specific method of > compression (including none). That allows to enable compression for bulk > work, and disable it in other cases (e.g. for security sensitive > content, or for unlikely to compress well content). It will significantly complicate implementation (because of buffering at different levels). Also it is not clear to me who and how will control enabling/disabling compression in this case? I can imagine that "\copy" should trigger compression. But what about (server side) "copy" command? Or just select returning huge results? I do not think that making user or application to enable/disable compression on the fly is really good idea. Overhead of compressing small commands is not so large. And concerning security risks... In most cases such problem is not relevant at all because both client and server are located within single reliable network. It if security of communication really matters, you should not switch compression in all cases (including COPY and other bulk data transfer). It is very strange idea to let client to decide which data is "security sensitive" and which not. > > I think that would also make cross-version handling easier, because a > newer client driver can send the compression request and handle the > error, without needing to reconnect or such. > > Most importantly, I think such a design is basically a necessity to make > connection poolers to work in a sensible way. I do not completely understand the problem with connection pooler. Right now developers of Yandex Odyssey are trying to support libpq compression in their pooler. If them will be faced with some problems, I will definitely address them. > And lastly, wouldn't it be reasonable to allow to specify things like > compression levels? All that doesn't have to be supported now, but I > think the protocol should take that into account. Well, if we want to provide the maximal flexibility, then we should allow to specify compression level. Practically, when I have implemented our CFS compressed storage for pgpro-ee and libpq_compression I have performed a lot benchmarks comparing different compression algorithms with different compression levels. Definitely I do not pretend on doing some research in this area. But IMHO default (fastest) compression level is always the preferable choice: it provides best compromise between speed and compression ratio. Higher compression levels significantly (several times) reduce compression speed, but influence on compression ratio are much smaller. More over, zstd with default compression level compresses synthetic data (i.e. strings will with spaces generated by pgbench) much better (with compression ratio 63!) than with higher compression levels. Right now in Postgres we do not allow user to specify compression level neither for compressing TOAST data, nether for WAL compression,... And IMHO for libpq protocol compression, possibility to specify compression level is even less useful. But if you think that it is so important, I will try to implement it. Many questions arise in this case: which side should control compression level? Should client affect compression level both at client side and at server side? Or it should be possible to specify separately compression level for client and for server? >> +<para> >> + Used compression algorithm. Right now the following streaming compression algorithms are supported: 'f' - Facebookzstd, 'z' - zlib, 'n' - no compression. >> +</para> > I would prefer this just be referenced as zstd or zstandard, not > facebook zstd. There's an RFC (albeit only "informational"), and it > doesn't name facebook, except as an employer: > https://tools.ietf.org/html/rfc8478 Please notice that it is internal encoding, user will specify psql -d "dbname=postgres compression=zstd" If name "zstd" is not good, I can choose any other. > > >> +int >> +pq_configure(Port* port) >> +{ >> + char* client_compression_algorithms = port->compression_algorithms; >> + /* >> + * If client request compression, it sends list of supported compression algorithms. >> + * Each compression algorirthm is idetified by one letter ('f' - Facebook zsts, 'z' - xlib) >> + */ > s/algorirthm/algorithm/ > s/idetified/identified/ > s/zsts/zstd/ > s/xlib/zlib/ > > That's, uh, quite the typo density. > Sorry, fixed >> + if (client_compression_algorithms) >> + { >> + char server_compression_algorithms[ZPQ_MAX_ALGORITHMS]; >> + char compression_algorithm = ZPQ_NO_COMPRESSION; >> + char compression[6] = {'z',0,0,0,5,0}; /* message length = 5 */ >> + int rc; > Why is this hand-rolling protocol messages? Sorry, I do not quite understand your concern. It seems to me that all libpq message manipulation is more or less hand-rolling, isn't it (we are not using protobuf or msgbpack)? Or do you think that calling pq_sendbyte,pq_sendint32,... is much safer in this case? > >> + /* Intersect lists */ >> + while (*client_compression_algorithms != '\0') >> + { >> + if (strchr(server_compression_algorithms, *client_compression_algorithms)) >> + { >> + compression_algorithm = *client_compression_algorithms; >> + break; >> + } >> + client_compression_algorithms += 1; >> + } > Why isn't this is handled within zpq? > It seems to be part of libpq client-server handshake protocol. It seems to me that zpq is lower level component which is just ordered which compressor to use. >> + /* Send 'z' message to the client with selectde comression algorithm ('n' if match is ont found) */ > s/selectde/selected/ > s/comression/compression/ > s/ont/not/ > :( Fixed. But looks like you are inspecting not the latest patch (libpq_compression-20.patch) Because two of this three mistypings I have already fixed. >> + socket_set_nonblocking(false); >> + while ((rc = secure_write(MyProcPort, compression, sizeof(compression))) < 0 >> + && errno == EINTR); >> + if ((size_t)rc != sizeof(compression)) >> + return -1; > Huh? This all seems like an abstraction violation. > > >> + /* initialize compression */ >> + if (zpq_set_algorithm(compression_algorithm)) >> + PqStream = zpq_create((zpq_tx_func)secure_write, (zpq_rx_func)secure_read, MyProcPort); >> + } >> + return 0; >> +} > Why is zpq a wrapper around secure_write/read? I'm a bit worried this > will reduce the other places we could use zpq. zpq has to read/write data from underlying stream. And it should be used both in client and server environment. I didn't see other ways to provide single zpq implementation without code duplication except pass to it rx/tx functions. > > >> static int >> -pq_recvbuf(void) >> +pq_recvbuf(bool nowait) >> { >> + /* If srteaming compression is enabled then use correpondent comression read function. */ > s/srteaming/streaming/ > s/correpondent/correponding/ > s/comression/compression/ > > Could you please try to proof-read the patch a bit? The typo density > is quite high. Once again, sorry Will do. > >> + r = PqStream >> + ? zpq_read(PqStream, PqRecvBuffer + PqRecvLength, >> + PQ_RECV_BUFFER_SIZE - PqRecvLength, &processed) >> + : secure_read(MyProcPort, PqRecvBuffer + PqRecvLength, >> + PQ_RECV_BUFFER_SIZE - PqRecvLength); >> + PqRecvLength += processed; > ? : doesn't make sense to me in this case. This should be an if/else. > Isn't it a matter of style preference? Why if/else is principle better than ?: I agree that sometimes ?: leads to more complex and obscure expressions. But to you think that if-else in this case will lead to more clear or readable code? Another question is whether conditional expression here is really good idea. I prefer to replace with indirect function call... But there are several reasons which makes me to prefer such straightforward and not nice way (at lease difference in function profiles). >> if (r < 0) >> { >> + if (r == ZPQ_DECOMPRESS_ERROR) >> + { >> + char const* msg = zpq_error(PqStream); >> + if (msg == NULL) >> + msg = "end of stream"; >> + ereport(COMMERROR, >> + (errcode_for_socket_access(), >> + errmsg("failed to decompress data: %s", msg))); >> + return EOF; >> + } > I don't think we should error out with "failed to decompress data:" > e.g. when the client closed the connection. Sorry, but this error is reported only when ZPQ_DECOMPRESS_ERROR is returned. It means that received data can not be decompressed but not that client connection is broken. > > > >> @@ -1413,13 +1457,18 @@ internal_flush(void) >> char *bufptr = PqSendBuffer + PqSendStart; >> char *bufend = PqSendBuffer + PqSendPointer; >> >> - while (bufptr < bufend) >> + while (bufptr < bufend || zpq_buffered(PqStream) != 0) /* has more data to flush or unsent data in internal compressionbuffer */ >> { > Overly long line. Fixed > This should at least specify how these functions are supposed to handle > blocking/nonblocking sockets. > Blocking/nonblocking control is done by upper layer. This zpq functions implementation calls underlying IO functions and do not care if this calls are blocking or nonblocking. >> + >> +#define ZSTD_BUFFER_SIZE (8*1024) >> +#define ZSTD_COMPRESSION_LEVEL 1 > Add some arguments for choosing these parameters. > What are the suggested way to specify them? I can not put them in GUCs (because them are also needed at client side). May it possible to for client to specify them in connection string: psql -d "compression='ztsd/level=10/buffer=8k" It seems to be awful and overkill, isn't it? >> + >> +/* >> + * Array with all supported compression algorithms. >> + */ >> +static ZpqAlgorithm const zpq_algorithms[] = >> +{ >> +#if HAVE_LIBZSTD >> + {zstd_name, zstd_create, zstd_read, zstd_write, zstd_free, zstd_error, zstd_buffered}, >> +#endif >> +#if HAVE_LIBZ >> + {zlib_name, zlib_create, zlib_read, zlib_write, zlib_free, zlib_error, zlib_buffered}, >> +#endif >> + {NULL} >> +}; > I think it's preferrable to use designated initializers. > > Do we really need zero terminated lists? Works fine, but brrr. Once again - a matter of taste:) Standard C practice IMHO - not invented by me and widely used in Postgres code;) > >> +/* >> + * Index of used compression algorithm in zpq_algorithms array. >> + */ >> +static int zpq_algorithm_impl; > This is just odd API design imo. Why doesn't the dispatch work based on > an argument for zpq_create() and the ZpqStream * for the rest? > > What if there's two libpq connections in one process? To servers > supporting different compression algorithms? This isn't going to fly. Fixed. > >> +/* >> + * Get list of the supported algorithms. >> + * Each algorithm is identified by one letter: 'f' - Facebook zstd, 'z' - zlib. >> + * Algorithm identifies are appended to the provided buffer and terminated by '\0'. >> + */ >> +void >> +zpq_get_supported_algorithms(char algorithms[ZPQ_MAX_ALGORITHMS]) >> +{ >> + int i; >> + for (i = 0; zpq_algorithms[i].name != NULL; i++) >> + { >> + Assert(i < ZPQ_MAX_ALGORITHMS); >> + algorithms[i] = zpq_algorithms[i].name(); >> + } >> + Assert(i < ZPQ_MAX_ALGORITHMS); >> + algorithms[i] = '\0'; >> +} > Uh, doesn't this bake ZPQ_MAX_ALGORITHMS into the ABI? That seems > entirely unnecessary? I tried to avoid use of dynamic memory allocation because zpq is used both in client and server environments with different memory allocation policies. >> @@ -2180,6 +2257,20 @@ build_startup_packet(const PGconn *conn, char *packet, >> ADD_STARTUP_OPTION("replication", conn->replication); >> if (conn->pgoptions && conn->pgoptions[0]) >> ADD_STARTUP_OPTION("options", conn->pgoptions); >> + if (conn->compression && conn->compression[0]) >> + { >> + bool enabled; >> + /* >> + * If compressoin is enabled, then send to the server list of compression algorithms >> + * supported by client >> + */ > s/compressoin/compression/ Fixed >> + if (parse_bool(conn->compression, &enabled)) >> + { >> + char compression_algorithms[ZPQ_MAX_ALGORITHMS]; >> + zpq_get_supported_algorithms(compression_algorithms); >> + ADD_STARTUP_OPTION("compression", compression_algorithms); >> + } >> + } > I think this needs to work in a graceful manner across server > versions. You can make that work with an argument, using the _pq_ > parameter stuff, but as I said earlier, I think it's a mistake to deal > with this in the startup packet anyway. Sorry, I think that it should be quite easy for user to toggle compression. Originally I suggest to add -z option to psql and other Postgres utilities working with libpq protocol. Adding new options was considered by reviewer as bad idea and so I left correspondent option in connection string: psql -d "dbname=postgres compression=zlib" It is IMHO less convenient than "psql -z postgres". And I afraid that using the _pq_ parameter stuff makes enabling of compression even less user friendly. So can replace it to _pq_ convention if there is consensus that adding "compression" to startup package should be avoided. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: