Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers
From | Nazir Bilal Yavuz |
---|---|
Subject | Re: Speed up COPY FROM text/CSV parsing using SIMD |
Date | |
Msg-id | CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com Whole thread Raw |
In response to | Speed up COPY FROM text/CSV parsing using SIMD (Shinya Kato <shinya11.kato@gmail.com>) |
List | pgsql-hackers |
Hi, Thank you for working on this! On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote: > > Hi hackers, > > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv, > text}) command and observed approximately a 5% performance > improvement. Please see the detailed test results below. I have been working on the same idea. I was not moving input_buf_ptr as far as possible, so I think your approach is better. Also, I did a benchmark on text format. I created a benchmark for line length in a table being from 1 byte to 1 megabyte.The peak improvement is line length being 4096 and the improvement is more than 20% [1], I saw no regression on your patch. > Idea > ==== > The current text/CSV parser processes input byte-by-byte, checking > whether each byte is a special character (\n, \r, quote, escape) or a > regular character, and transitions states in a state machine. This > sequential processing is inefficient and likely causes frequent branch > mispredictions due to the many if statements. > > I thought this problem could be addressed by leveraging SIMD and > vectorized operations for faster processing. > > Implementation Overview > ======================= > 1. Create a vector of special characters (e.g., Vector8 nl = > vector8_broadcast('\n');). > 2. Load the input buffer into a Vector8 variable called chunk. > 3. Perform vectorized operations between chunk and the special > character vectors to check if the buffer contains any special > characters. > 4-1. If no special characters are found, advance the input_buf_ptr by > sizeof(Vector8). > 4-2. If special characters are found, advance the input_buf_ptr as far > as possible, then fall back to the original text/CSV parser for > byte-by-byte processing. > ... > Thought? > I would appreciate feedback on the implementation and any suggestions > for further improvement. I have a couple of ideas that I was working on: --- + * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly. I think you can continue SIMD inside quoted fields. Only important thing is you need to set last_was_esc to false when SIMD skipped the chunk. --- + * - When the remaining buffer size is smaller than the size of a SIMD + * vector register, as SIMD operations require processing data in + * fixed-size chunks. You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)' but you only call CopyLoadInputBuf() when 'input_buf_ptr >= copy_buf_len || need_data' so basically you need to wait at least the sizeof(Vector8) character to pass for the next SIMD. And in the worst case; if CopyLoadInputBuf() puts one character less than sizeof(Vector8), then you can't ever run SIMD. I think we need to make sure that CopyLoadInputBuf() loads at least the sizeof(Vector8) character to the input_buf so we do not encounter that problem. --- What do you think about adding SIMD to CopyReadAttributesText() and CopyReadAttributesCSV() functions? When I add your SIMD approach to CopyReadAttributesText() function, the improvement on the 4096 byte line length input [1] goes from 20% to 30%. --- I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's radar for this thread). I hope these help, please let me know if you have any questions. -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
pgsql-hackers by date: