Re: Parallel copy - Mailing list pgsql-hackers
From | Kuntal Ghosh |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | CAGz5QCLcwA17t0yVWF9nasohWaxPJdM9Xvm_jWSazu8yBAbw8A@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel copy (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel copy
Re: Parallel copy |
List | pgsql-hackers |
Hello, I was going through some literatures on parsing CSV files in a fully parallelized way and found (from [1]) an interesting approach implemented in the open-source project ParaText[2]. The algorithm follows a two-phase approach: the first pass identifies the adjusted chunks in parallel by exploiting the simplicity of CSV formats and the second phase processes complete records within each adjusted chunk by one of the available workers. Here is the sketch: 1. Each worker scans a distinct fixed sized chunk of the CSV file and collects the following three stats from the chunk: a) number of quotes b) position of the first new line after even number of quotes c) position of the first new line after odd number of quotes 2. Once stats from all the chunks are collected, the leader identifies the adjusted chunk boundaries by iterating over the stats linearly: - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. - If the number is even, then the k-th chunk does not start in the middle of a quoted field, and the first newline after an even number of quotes (the second collected information) is the first record delimiter in this chunk. - Otherwise, if the number is odd, the first newline after an odd number of quotes (the third collected information) is the first record delimiter. - The end position of the adjusted chunk is obtained based on the starting position of the next adjusted chunk. 3. Once the boundaries of the chunks are determined (forming adjusted chunks), individual worker may take up one adjusted chunk and process the tuples independently. Although this approach parses the CSV in parallel, it requires two scan on the CSV file. So, given a system with spinning hard-disk and small RAM, as per my understanding, the algorithm will perform very poorly. But, if we use this algorithm to parse a CSV file on a multi-core system with a large RAM, the performance might be improved significantly [1]. Hence, I was trying to think whether we can leverage this idea for implementing parallel COPY in PG. We can design an algorithm similar to parallel hash-join where the workers pass through different phases. 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and the small stats about each chunk in the shared memory. If the shared memory is full, go to phase 2. 2. Phase 2 - Allow a single worker to process the stats and decide the actual chunk boundaries so that no tuple spans across two different chunks. Go to phase 3. 3. Phase 3 - Each worker picks one adjusted chunk, parse and process tuples from the same. Once done with one chunk, it picks the next one and so on. 4. If there are still some unread contents, go back to phase 1. We can probably use separate workers for phase 1 and phase 3 so that they can work concurrently. Advantages: 1. Each worker spends some significant time in each phase. Gets benefit of the instruction cache - at least in phase 1. 2. It also has the same advantage of parallel hash join - fast workers get to work more. 3. We can extend this solution for reading data from STDIN. Of course, the phase 1 and phase 2 must be performed by the leader process who can read from the socket. Disadvantages: 1. Surely doesn't work if we don't have enough shared memory. 2. Probably, this approach is just impractical for PG due to certain limitations. Thoughts? [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf [2] ParaText. https://github.com/wiseio/paratext. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: