Home > mailing lists

Re: Optimizing count(), but Explain estimates wildly off - Mailing list pgsql-performance

From	Greg Sabino Mullane
Subject	Re: Optimizing count(), but Explain estimates wildly off
Date	March 5, 2024 17:04:17
Msg-id	CAKAnmmJSstuCamBjd5N8o9Y-EzZHawZFC6ofvuA0nHAcw1pWWQ@mail.gmail.com Whole thread Raw
In response to	Re: Optimizing count(), but Explain estimates wildly off (Chema <chema@interneta.org>)
Responses	Re: Optimizing count(), but Explain estimates wildly off
List	pgsql-performance

Tree view

> columns has not improved the planner's estimates, which are off by almost

> 1M, and there's been no suggestion of what could cause that.

You are asking a lot of the planner - how would it know that the average number of items is much higher for ids derived indirectly from "Mexico" versus ids derived from "Columbia"?

One thing you could try just as a general performance gain is index-only scans, by creating an index like this:

create index tenders_date_country_id on tenders (country, "date") include (transaction_id);

>> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101 width=522)

>> (actual time=0.753..41654.507 rows=2621681 loops=3)

> Why does it take over 41 seconds to read a table with less than 3 million rows?

Good question. I still maintain it's because you are doing a 'select star' on large, toasted rows.

I made two tables of the same approximate number of rows, and ran the query. It returned a hash join containing:

-> Parallel Seq Scan on items (cost=0.00..69602.93 rows=3375592 width=8)

(actual time=0.015..185.414 rows=2700407 loops=3)

Then I boosted the width by a lot by adding some filled text columns, and it returned the same number of rows, but much slower:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715 width=1562)

(actual time=0.027..36693.986 rows=2700407 loops=3)

A second run with everything in cache was better, but still an order of magnitude worse the small row:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715 width=1562)

(actual time=0.063..1565.486 rows=2700407 loops=3)

Best of all was a "SELECT 1" which switched the entire plan to a much faster merge join, resulting in:

-> Parallel Index Only Scan using items_tender_transaction_id_index on items (cost=0.43..101367.60 rows=3372717 width=4)

(actual time=0.087..244.878 rows=2700407 loops=3)

Yours will be different, as I cannot exactly duplicate your schema or data distribution, but give "SELECT 1" a try. This was on Postgres 16, FWIW, with a default_statistics_target of 100.

Cheers,

Greg

pgsql-performance by date:

From: Chema
Date: 05 March 2024, 14:00:00
Subject: Re: Optimizing count(), but Explain estimates wildly off

From: Tomas Vondra
Date: 05 March 2024, 20:13:13
Subject: Re: Separate 100 M spatial data in 100 tables VS one big table

Re: Optimizing count(), but Explain estimates wildly off - Mailing list pgsql-performance

Previous

Next