Re: Parallel Seq Scan vs kernel read ahead - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Parallel Seq Scan vs kernel read ahead |
Date | |
Msg-id | CAApHDvoDzAzXEp+Ay2CfT3U=ZcD5NLD7K9_Y936bnHjzs5jkHw@mail.gmail.com Whole thread Raw |
In response to | RE: Parallel Seq Scan vs kernel read ahead ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>) |
Responses |
Re: Parallel Seq Scan vs kernel read ahead
|
List | pgsql-hackers |
Hi Kirk, Thank you for doing some testing on this. It's very useful to get some samples from other hardware / filesystem / os combinations. On Tue, 21 Jul 2020 at 21:38, k.jamison@fujitsu.com <k.jamison@fujitsu.com> wrote: > Query Planner I/O Timings (ms): > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O WRITE (Patch) | > |--------|-------------------|------------------|--------------------|-------------------| > | 0 | "1,130.78" | "1,250.82" | "1,698.05" | "1,733.44" | > | Worker | Buffers | > |--------|--------------------------------------------------| > | 0 | shared read=442478 dirtied=442478 written=442446 | I'm thinking the scale of this test might be a bit too small for the machine you're using to test. When you see "shared read" in the EXPLAIN (ANALYZE, BUFFERS) output, it does not necessarily mean that the page had to be read from disk. We use buffered I/O, so the page could just have been fetched from the kernel's cache. If we do some maths here on the timing. It took 1130.78 milliseconds to read 442478 pages, which, assuming the standard page size of 8192 bytes, that's 3457 MB in 1130.78 milliseconds, or 3057 MB/sec. Is that a realistic throughput for this machine in terms of I/O? Or do you think that some of these pages might be coming from the Kernel's cache? I understand that Amit wrote: On Fri, 17 Jul 2020 at 21:18, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think recreating the database and restarting the server after each > run might help in getting consistent results. Also, you might want to > take median of three runs. Please also remember, if you're recreating the database after having restarted the machine that creating the table will likely end up caching some of the pages either in shared buffers or the Kernel's cache. It would be better to leave the database intact and just reboot the machine. I didn't really like that option with my tests so I just increased the size of the table beyond any size that my machines could have cached. With the 16GB RAM Windows laptop, I used a 100GB table and with the 64GB workstation, I used an 800GB table. I think my test using SELECT * FROM t WHERE a < 0; with a table that has a padding column is likely going to be a more accurate test. Providing there is no rows with a < 0 in the table then the executor will spend almost all of the time in nodeSeqscan.c trying to find a row with a < 0. There's no additional overhead of aggregation doing the count(*). Having the additional padding column means that we read more data per evaluation of the a < 0 expression. Also, having a single column table is not that realistic. I'm pretty keen to see this machine running something closer to the test I mentioned in [1] but the benchmark query I mentioned in [2] with the "t" table being at least twice the size of RAM in the machine. Larger would be better though. With such a scaled test, I don't think there's much need to reboot the machine in between. Just run a single query first to warm up the cache before timing anything. Having the table a few times larger than RAM will mean that we can be certain that the disk was actually used during the test. The more data we can be certain came from disk the more we can trust that the results are meaningful. Thanks again for testing this. David [1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn+Sa7NTKEVWtg@mail.gmail.com [2] https://www.postgresql.org/message-id/CAApHDvo+LEGKMcavOiPYK8NEbgP-LrXns2TJ1n_XNRJVE9X+Cw@mail.gmail.com
pgsql-hackers by date: