Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Add LZ4 compression in pg_dump |
Date | |
Msg-id | 3da84c90-7e12-75ee-a582-025bc68dd098@enterprisedb.com Whole thread Raw |
In response to | Re: Add LZ4 compression in pg_dump (gkokolatos@pm.me) |
Responses |
Re: Add LZ4 compression in pg_dump
|
List | pgsql-hackers |
On 3/11/23 11:50, gkokolatos@pm.me wrote: > ------- Original Message ------- > On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote: > >> Hello, >> 23.02.2023 23:24, Tomas Vondra wrote: >> >>> On 2/23/23 16:26, Tomas Vondra wrote: >>> >>>> Thanks for v30 with the updated commit messages. I've pushed 0001 after >>>> fixing a comment typo and removing (I think) an unnecessary change in an >>>> error message. >>>> >>>> I'll give the buildfarm a bit of time before pushing 0002 and 0003. >>> >>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.), >>> and marked the CF entry as committed. Thanks for the patch! >>> >>> I wonder how difficult would it be to add the zstd compression, so that >>> we don't have the annoying "unsupported" cases. >> >> >> With the patch 0003 committed, a single warning -Wtype-limits appeared in the >> master branch: >> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8 >> compress_lz4.c: In function ‘LZ4File_gets’: >> compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 0’ is >> always false [-Wtype-limits] >> 492 | if (dsize < 0) >> | > > Thank you Alexander. Please find attached an attempt to address it. > >> (I wonder, is it accidental that there no other places that triggers >> the warning, or some buildfarm animals had this check enabled before?) > > I can not answer about the buildfarms. Do you think that adding an explicit > check for this warning in meson would help? I am a bit uncertain as I think > that type-limits are included in extra. > > @@ -1748,6 +1748,7 @@ common_warning_flags = [ > '-Wshadow=compatible-local', > # This was included in -Wall/-Wformat in older GCC versions > '-Wformat-security', > + '-Wtype-limits', > ] > >> >> It is not a false positive as can be proved by the 002_pg_dump.pl modified as >> follows: >> - program => $ENV{'LZ4'}, >> >> + program => 'mv', >> >> args => [ >> >> - '-z', '-f', '--rm', >> "$tempdir/compression_lz4_dir/blobs.toc", >> "$tempdir/compression_lz4_dir/blobs.toc.lz4", >> ], >> }, > > Correct, it is not a false positive. The existing testing framework provides > limited support for exercising error branches. Especially so when those are > dependent on generated output. > >> A diagnostic logging added shows: >> LZ4File_gets() after LZ4File_read_internal; dsize: 18446744073709551615 >> >> and pg_restore fails with: >> error: invalid line in large object TOC file >> ".../src/bin/pg_dump/tmp_check/tmp_test_22ri/compression_lz4_dir/blobs.toc": "????" > > It is a good thing that the restore fails with bad input. Yet it should > have failed earlier. The attached makes certain it does fail earlier. > Thanks for the patch. I did look if there are other places that might have the same issue, and I think there are - see attached 0002. For example LZ4File_write is declared to return size_t, but then it also does if (LZ4F_isError(status)) { fs->errcode = status; return -1; } That won't work :-( And these issues may not be restricted to lz4 code - Gzip_write is declared to return size_t, but it does return gzwrite(gzfp, ptr, size); and gzwrite returns int. Although, maybe that's correct, because gzwrite() is "0 on error" so maybe this is fine ... However, Gzip_read assigns gzread() to size_t, and that does not seem great. It probably will still trigger the following pg_fatal() because it'd be very lucky to match the expected 'size', but it's confusing. I wonder whether CompressorState should use int or size_t for the read_func/write_func callbacks. I guess no option is perfect, i.e. no data type will work for all compression libraries we might use (lz4 uses int while lz4f uses size_t, to there's that). It's a bit weird the "open" functions return int and the read/write size_t. Maybe we should stick to int, which is what the old functions (cfwrite etc.) did. But I think the actual problem here is that the API does not clearly define how errors are communicated. I mean, it's nice to return the value returned by the library function without "mangling" it by conversion to size_t, but what if the libraries communicate errors in different way? Some may return "0" while others may return "-1". I think the right approach is to handle all library errors and not just let them through. So Gzip_write() needs to check the return value, and either call pg_fatal() or translate it to an error defined by the API. For example we might say "returns 0 on error" and then translate all library-specific errors to that. While looking at the code I realized a couple function comments don't say what's returned in case of error, etc. So 0004 adds those. 0003 is a couple minor assorted comments/questions: - Should we move ZLIB_OUT_SIZE/ZLIB_IN_SIZE to compress_gzip.c? - Why are LZ4 buffer sizes different (ZLIB has both 4kB)? - I wonder if we actually need LZ4F_HEADER_SIZE_MAX? Is it even possible for LZ4F_compressBound to return value this small (especially for 16kB input buffer)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: