Home > mailing lists

Re: Query Performance / Planner estimate off - Mailing list pgsql-performance

From	Sebastian Dressler
Subject	Re: Query Performance / Planner estimate off
Date	October 21, 2020 15:35:07
Msg-id	4CF9B275-8B82-48B5-94AD-BDB6FA5BF3C2@swarm64.com Whole thread Raw
In response to	Re: Query Performance / Planner estimate off (Mats Olsen <mats@duneanalytics.com>)
Responses	Re: Query Performance / Planner estimate off
List	pgsql-performance

Tree view

Hi Mats,

Happy to help.

On 21. Oct 2020, at 16:42, Mats Olsen <mats@duneanalytics.com> wrote:
On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

Yes, that looks like a given in this case.

- I however think that the misestimate might be caused by the evt_tx_hash being of type bytea. I believe that PG cannot estimate this very well for JOINs and will rather pick row numbers too low. Hence the nested loop is picked and there might be no way around this. I have experienced similar things when applying JOINs on VARCHAR with e.g. more than 3 fields for comparison.
This is very interesting, and I have never heard of issues with using `bytea` for joins. Our entire database is filled with them, as we deal with hashes of different lengths. In fact I would estimate that 60% of columns are bytea's. My intuition would say that it's better to store the hashes as byte arrays, rather than `text` fields as you can compare the raw bytes directly without encoding first? Do you have any references for this?

Unfortunately, I have not dealt yet with `bytea` that much. It just rang a bell when I saw these kind of off-estimates in combination with nested loops. In the case I referenced it was, that the tables had 3 VARCHAR columns to be joined on and the estimate was very much off. As a result, PG chose nested loops in the upper layers of processing. Due to another JOIN the estimate went down to 1 row whereas it was 1 million rows in reality. Now, yours is "only" a factor 5 away, i.e. this might be a totally different reason.

However, I looked into the plan once more and realized, that the source of the problem could also be the scan on "Pair_evt_Mint" along the date dimension. Although you have a stats target of 10k there. If the timestamp is (roughly) sorted, you could try adding a BRIN index and by that maybe get a better estimate & scan-time.

Alternatively, since I know the length of the hashes in advance, I could've used `varchar(n)`, but I don't think there's any gains to be had in postgres by doing that? Something like `bytea(n)` would also have been interesting, had postgres been able to exploit that information.

I think giving VARCHAR a shot makes sense, maybe on an experimental basis to see whether the estimates get better. Maybe PG can then estimate that there are (almost) no dupes within the table but that there are N-many across tables. Another option to explore is maybe to use UUID as a type. As said above, it more looks like the timestamp causing the mis-estimate.

Maybe try querying this table by itself with that timestamp to see what kind of estimate you get?

- Other things to look into:

- work_mem seems too low to me with 56MB, consider raising this to the GB range to avoid disk-based operations
- min_parallel_table_scan_size - try 0
- parallel_setup_cost (default 1000, maybe try 500)
- parallel_tuple_cost (default 1.0, maybe try 0.1)
- random_page_cost (as mentioned consider raising this maybe much higher, factor 10 or sth like this) or (typically) seq_page_cost can be possibly much lower (0.1, 0.01) depending on your storage
I've tried various settings of these parameters now, and unfortunately the only parameter that alters the query plan is the last one (random_page_cost), which also has the side effect of (almost) forcing sequential scans for most queries as far as I understand? Our storage is Google Cloud pd-ssd.

I think a combination of random_page_cost with parallel_tuple_cost and min_parallel_table_scan_size might make sense. By that you possibly get at least parallel sequential scans. But I understand that this is possibly having the same effect as using `enable_nestloop = off`.

Thank you so much for you response, I'm looking forward to keep the discussion going.

You're very welcome.

Best,

Sebastian

--

Sebastian Dressler, Solution Architect
+49 30 994 0496 72 | sebastian@swarm64.com

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B

Attachment

image001.jpg

pgsql-performance by date:

From: Michael Lewis
Date: 21 October 2020, 15:29:53
Subject: Re: Query Performance / Planner estimate off

From: Nagaraj Raj
Date: 22 October 2020, 00:32:29
Subject: Query performance

Re: Query Performance / Planner estimate off - Mailing list pgsql-performance

Attachment

Previous

Next