Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: Add LZ4 compression in pg_dump |
Date | |
Msg-id | ZAoGO+/Besr40GLp@telsasoft.com Whole thread Raw |
In response to | Re: Add LZ4 compression in pg_dump (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Add LZ4 compression in pg_dump
|
List | pgsql-hackers |
On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote: > On 2/27/23 05:49, Justin Pryzby wrote: > > On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote: > >> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote: > >>> I have some fixes (attached) and questions while polishing the patch for > >>> zstd compression. The fixes are small and could be integrated with the > >>> patch for zstd, but could be applied independently. > >> > >> One more - WriteDataToArchiveGzip() says: > > > > One more again. > > > > The LZ4 path is using non-streaming mode, which compresses each block > > without persistent state, giving poor compression for -Fc compared with > > -Fp. If the data is highly compressible, the difference can be orders > > of magnitude. > > > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c > > 12351763 > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c > > 21890708 > > > > That's not true for gzip: > > > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c > > 2118869 > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c > > 2115832 > > > > The function ought to at least use streaming mode, so each block/row > > isn't compressioned in isolation. 003 is a simple patch to use > > streaming mode, which improves the -Fc case: > > > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c > > 15178283 > > > > However, that still flushes the compression buffer, writing a block > > header, for every row. With a single-column table, pg_dump -Fc -Z lz4 > > still outputs ~10% *more* data than with no compression at all. And > > that's for compressible data. > > > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c > > 12890296 > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c > > 11890296 > > > > I think this should use the LZ4F API with frames, which are buffered to > > avoid outputting a header for every single row. The LZ4F format isn't > > compatible with the LZ4 format, so (unlike changing to the streaming > > API) that's not something we can change in a bugfix release. I consider > > this an Opened Item. > > > > With the LZ4F API in 004, -Fp and -Fc are essentially the same size > > (like gzip). (Oh, and the output is three times smaller, too.) > > > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c > > 4155448 > > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c > > 4156548 > > Thanks. Those are definitely interesting improvements/optimizations! > > I suggest we track them as a separate patch series - please add them to > the CF app (I guess you'll have to add them to 2023-07 at this point, > but we can get them in, I think). Thanks for looking. I'm not sure if I'm the best person to write/submit the patch to implement that for LZ4. Georgios, would you want to take on this change ? I think that needs to be changed for v16, since 1) LZ4F works so much better like this, and 2) we can't change it later without breaking compatibility of the dumpfiles by changing the header with another name other than "lz4". Also, I imagine we'd want to continue supporting the ability to *restore* a dumpfile using the old(current) format, which would be untestable code unless we also preserved the ability to write it somehow (like -Z lz4-old). One issue is that LZ4F_createCompressionContext() and LZ4F_compressBegin() ought to be called in InitCompressorLZ4(). It seems like it might *need* to be called there to avoid exactly the kind of issue that I reported with empty LOs with gzip. But InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't write the header. And LZ4CompressorState has a simple char *buf, and not an more elaborate data structure like zlib. You could work around that by keeping storing the "len" of the existing buffer, and flushing it in EndCompressorLZ4(), but that adds needless complexity to the Write and End functions. Maybe the Init function should be passed the AH. -- Justin
pgsql-hackers by date: