Re: zstd compression for pg_dump - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: zstd compression for pg_dump |
Date | |
Msg-id | ZBKgBz+4blcKru6x@telsasoft.com Whole thread Raw |
In response to | Re: zstd compression for pg_dump (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: zstd compression for pg_dump
Re: zstd compression for pg_dump |
List | pgsql-hackers |
On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote: > > I did some smoke testing against zstd's GitHub release on Windows. To > > build against it, I had to construct an import library, and put that > > and the DLL into the `lib` folder expected by the MSVC scripts... > > which makes me wonder if I've chosen a harder way than necessary? > > It looks like pg_dump's meson.build is missing dependencies on zstd > (meson couldn't find the headers in the subproject without them). I saw that this was added for LZ4, but I hadn't added it for zstd since I didn't run into an issue without it. Could you check that what I've added works for your case ? > > Parallel zstd dumps seem to work as expected, in that the resulting > > pg_restore output is identical to uncompressed dumps and nothing > > explodes. I haven't inspected the threading implementation for safety > > yet, as you mentioned. > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > handling safety for this, by isolating each thread's state. I don't feel > comfortable pronouncing this new addition safe or not, because I'm not > sure I understand what the comments in the format-specific _Clone() > callbacks are saying yet. My line of reasoning for unix is that pg_dump forks before any calls to zstd. Nothing zstd does ought to affect the pg_dump layer. But that doesn't apply to pg_dump under windows. This is an opened question. If there's no solid answer, I could disable/ignore the option (maybe only under windows). > On to code (not a complete review): > > > if (hasSuffix(fname, ".gz")) > > compression_spec.algorithm = PG_COMPRESSION_GZIP; > > else > > { > > bool exists; > > > > exists = (stat(path, &st) == 0); > > /* avoid unused warning if it is not built with compression */ > > if (exists) > > compression_spec.algorithm = PG_COMPRESSION_NONE; > > -#ifdef HAVE_LIBZ > > - if (!exists) > > - { > > - free_keep_errno(fname); > > - fname = psprintf("%s.gz", path); > > - exists = (stat(fname, &st) == 0); > > - > > - if (exists) > > - compression_spec.algorithm = PG_COMPRESSION_GZIP; > > - } > > -#endif > > -#ifdef USE_LZ4 > > - if (!exists) > > - { > > - free_keep_errno(fname); > > - fname = psprintf("%s.lz4", path); > > - exists = (stat(fname, &st) == 0); > > - > > - if (exists) > > - compression_spec.algorithm = PG_COMPRESSION_LZ4; > > - } > > -#endif > > + else if (check_compressed_file(path, &fname, "gz")) > > + compression_spec.algorithm = PG_COMPRESSION_GZIP; > > + else if (check_compressed_file(path, &fname, "lz4")) > > + compression_spec.algorithm = PG_COMPRESSION_LZ4; > > + else if (check_compressed_file(path, &fname, "zst")) > > + compression_spec.algorithm = PG_COMPRESSION_ZSTD; > > } > > This function lost some coherence, I think. Should there be a hasSuffix > check at the top for ".zstd" (and, for that matter, ".lz4")? The function is first checking if it was passed a filename which already has a suffix. And if not, it searches through a list of suffixes, testing for an existing file with each suffix. The search with stat() doesn't happen if it has a suffix. I'm having trouble seeing how the hasSuffix() branch isn't dead code. Another opened question. > I'm a little suspicious of the replacement of supports_compression() > with parse_compress_specification(). For example: > > > - errmsg = supports_compression(AH->compression_spec); > > - if (errmsg) > > + parse_compress_specification(AH->compression_spec.algorithm, > > + NULL, &compress_spec); > > + if (compress_spec.parse_error != NULL) > > { > > pg_log_warning("archive is compressed, but this installation does not support compression (%s > > - errmsg); > > - pg_free(errmsg); > > + compress_spec.parse_error); > > + pg_free(compress_spec.parse_error); > > } > > The top-level error here is "does not support compression", but wouldn't > a bad specification option with a supported compression method trip this > path too? No - since the 2nd argument is passed as NULL, it just checks whether the compression is supported. Maybe there ought to be a more direct/clean way to do it. But up to now evidently nobody needed to do that. > > +static void > > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream, > > + ZSTD_cParameter param, int value, char *paramname) > > IMO we should avoid stepping on the ZSTD_ namespace with our own > internal function names. done > > + if (cs->readF != NULL) > > + > > + if (cs->writeF != NULL) > > This seems to suggest that both cs->readF and cs->writeF could be set, > but in that case, the output buffer gets reallocated. I put back an assertion that exactly one of them was set, since that's true of how it currently works. > I was curious about the extra byte allocated in the decompression case. > I see that ReadDataFromArchiveZstd() is null-terminating the buffer > before handing it to ahwrite(), but why does it need to do that? I was trying to figure that out, too. I think the unterminated case might be for ExecuteSqlCommandBuf(), and that may only (have) been needed to allow pg_restore to handle ancient/development versions of pg_dump... It's not currently hit. https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_db.c.gcov.html#470 I found that the terminator was added for the uncompressed case was added at e8f69be05 and removed in bf9aa490d. > > +Zstd_get_error(CompressFileHandle *CFH) > > Seems like this should be using the zstderror stored in the handle? Yes - I'd already addressed that locally. > In ReadDataFromArchiveZstd(): > > > + * Read compressed data. Note that readF can resize the buffer; the > > + * new size is tracked and used for future loops. > This is pretty complex for what it's doing. I'm a little worried that we > let the reallocated buffer escape to the caller while losing track of > how big it is. I think that works today, since there's only ever one > call per handle, but any future refactoring that allowed cs->readData() > to be called more than once would subtly break this code. Note that nothing bad happens if we lose track of how big it is (well, assuming that readF doesn't *shrink* the buffer). The previous patch version didn't keep track of its new size, and the only consequence is that readF() might re-resize it again on a future iteration, even if it was already sufficiently large. When I originally wrote it (and up until that patch version), I left this as an XXX comment about reusing the resized buffer. But it seemed easy enough to fix so I did. > In ZstdWriteCommon(): > > Elsewhere, output->pos is set to zero before compressing, but here we do > it after, which I think leads to subtle differences in the function > preconditions. If that's an intentional difference, can the reason be > called out in a comment? It's not deliberate. I think it had no effect, but changed - thanks. -- Justin
Attachment
pgsql-hackers by date: