Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Kuntal Ghosh
Subject	Re: Parallel copy
Date	April 14, 2020 19:40:32
Msg-id	CAGz5QCLcwA17t0yVWF9nasohWaxPJdM9Xvm_jWSazu8yBAbw8A@mail.gmail.com Whole thread Raw
In response to	Re: Parallel copy (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Parallel copy Re: Parallel copy
List	pgsql-hackers

Tree view

Hello,

I was going through some literatures on parsing CSV files in a fully
parallelized way and found (from [1]) an interesting approach
implemented in the open-source project ParaText[2]. The algorithm
follows a two-phase approach: the first pass identifies the adjusted
chunks in parallel by exploiting the simplicity of CSV formats and the
second phase processes complete records within each adjusted chunk by
one of the available workers. Here is the sketch:

1. Each worker scans a distinct fixed sized chunk of the CSV file and
collects the following three stats from the chunk:
a) number of quotes
b) position of the first new line after even number of quotes
c) position of the first new line after odd number of quotes
2. Once stats from all the chunks are collected, the leader identifies
the adjusted chunk boundaries by iterating over the stats linearly:
- For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
- If the number is even, then the k-th chunk does not start in the
middle of a quoted field, and the first newline after an even number
of quotes (the second collected information) is the first record
delimiter in this chunk.
- Otherwise, if the number is odd, the first newline after an odd
number of quotes (the third collected information) is the first record
delimiter.
- The end position of the adjusted chunk is obtained based on the
starting position of the next adjusted chunk.
3. Once the boundaries of the chunks are determined (forming adjusted
chunks), individual worker may take up one adjusted chunk and process
the tuples independently.

Although this approach parses the CSV in parallel, it requires two
scan on the CSV file. So, given a system with spinning hard-disk and
small RAM, as per my understanding, the algorithm will perform very
poorly. But, if we use this algorithm to parse a CSV file on a
multi-core system with a large RAM, the performance might be improved
significantly [1].

Hence, I was trying to think whether we can leverage this idea for
implementing parallel COPY in PG. We can design an algorithm similar
to parallel hash-join where the workers pass through different phases.
1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
the small stats about each chunk in the shared memory. If the shared
memory is full, go to phase 2.
2. Phase 2 - Allow a single worker to process the stats and decide the
actual chunk boundaries so that no tuple spans across two different
chunks. Go to phase 3.
3. Phase 3 - Each worker picks one adjusted chunk, parse and process
tuples from the same. Once done with one chunk, it picks the next one
and so on.
4. If there are still some unread contents, go back to phase 1.

We can probably use separate workers for phase 1 and phase 3 so that
they can work concurrently.

Advantages:
1. Each worker spends some significant time in each phase. Gets
benefit of the instruction cache - at least in phase 1.
2. It also has the same advantage of parallel hash join - fast workers
get to work more.
3. We can extend this solution for reading data from STDIN. Of course,
the phase 1 and phase 2 must be performed by the leader process who
can read from the socket.

Disadvantages:
1. Surely doesn't work if we don't have enough shared memory.
2. Probably, this approach is just impractical for PG due to certain
limitations.

Thoughts?

[1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf
[2] ParaText. https://github.com/wiseio/paratext.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: David Steele
Date: 14 April 2020, 19:19:23
Subject: Re: documenting the backup manifest file format

From: Andrew Dunstan
Date: 14 April 2020, 19:55:55
Subject: Re: documenting the backup manifest file format

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next