Thread: Optimizing count(), but Explain estimates wildly off

Optimizing count(), but Explain estimates wildly off

From

Chema

Date:

27 February 2024, 00:25:19

Dear pgsqlers,

I'm trying to optimize simple queries on two tables (tenders & items) with a couple million records. Besides the resulting records, the app also displays the count of total results. Doing count() takes as much time as the other query (which can be 30+ secs), so it's an obvious target for optimization. I'm already caching count() results for the most common conditions (country & year) in a material table, which practically halves response time. The tables are updated sparingly, and only with bulk COPYs. Now I'm looking for ways to optimize queries with other conditions.

Reading around, seems many people are still using this 2005 snippet to obtain the row count estimate from Explain:

CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE  rec   record;  rows  integer;
BEGIN  FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP    rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');    EXIT WHEN rows IS NOT NULL;  END LOOP;  RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

Is this still the current best practice? Any tips to increase precision? Currently it can estimate the actual number of rows for over or under a million, as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead of 1,292,010).

Any other tips to improve the query are welcome, of course. There's a big disparity between the two sample queries plans even though only the filtered country changes.

I already raised default_statistics_target up to 2k (the planner wasn't using indexes at all with low values). Gotta get it even higher? These are my custom settings:

shared_buffers = 256MB                  # min 128kB
work_mem = 128MB                                # min 64kB
maintenance_work_mem = 254MB            # min 1MB
effective_cache_size = 2GB
default_statistics_target = 2000
random_page_cost = 1.0                  # same scale as above

Sample query:

Explain Analyze Select * from tenders inner join items on transaction_id = tender_transaction_id where country = 'Colombia' and "date" >= '2023-01-01' and "date" < '2024-01-01'
QUERY PLAN
Gather (cost=253837.99..1506524.32 rows=1955297 width=823) (actual time=51433.592..63239.809 rows=1001200 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=252837.99..1309994.62 rows=814707 width=823) (actual time=51361.920..61729.142 rows=333733 loops=3) Hash Cond: (items.tender_transaction_id = tenders.transaction_id) -> Parallel Seq Scan on items (cost=0.00..1048540.46 rows=3282346 width=522) (actual time=1.689..56887.108 rows=2621681 loops=3) -> Parallel Hash (cost=247919.56..247919.56 rows=393475 width=301) (actual time=2137.473..2137.476 rows=333733 loops=3) Buckets: 1048576 Batches: 1 Memory Usage: 219936kB -> Parallel Bitmap Heap Scan on tenders (cost=16925.75..247919.56 rows=393475 width=301) (actual time=385.315..908.865 rows=333733 loops=3) Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date)) Heap Blocks: exact=24350 -> Bitmap Index Scan on tenders_country_and_date_index (cost=0.00..16689.67 rows=944339 width=0) (actual time=423.213..423.214 rows=1001200 loops=1) Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
Planning Time: 12.784 ms
JIT:
Functions: 33
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521 ms, Emission 651.442 ms, Total 2072.987 ms
Execution Time: 63378.033 ms

Explain Analyze Select * from tenders inner join items on transaction_id = tender_transaction_id where country = 'Mexico' and "date" >= '2023-01-01' and "date" < '2024-01-01'
QUERY PLAN
Gather (cost=1000.99..414258.70 rows=162080 width=823) (actual time=52.538..7006.128 rows=1292010 loops=1) Workers Planned: 2 Workers Launched: 2 -> Nested Loop (cost=0.99..397050.70 rows=67533 width=823) (actual time=40.211..4087.081 rows=430670 loops=3) -> Parallel Index Scan using tenders_country_and_date_index on tenders (cost=0.43..45299.83 rows=32616 width=301) (actual time=4.376..59.760 rows=1218 loops=3) Index Cond: ((country = 'Mexico'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date)) -> Index Scan using items_tender_transaction_id_index on items (cost=0.56..10.67 rows=11 width=522) (actual time=0.321..3.035 rows=353 loops=3655) Index Cond: (tender_transaction_id = tenders.transaction_id)
Planning Time: 7.808 ms
JIT:
Functions: 27
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 17.785 ms, Inlining 0.000 ms, Optimization 5.080 ms, Emission 93.274 ms, Total 116.138 ms
Execution Time: 7239.427 ms

Thanks in advance!

Re: Optimizing count(), but Explain estimates wildly off

From

Vitalii Tymchyshyn

Date:

27 February 2024, 04:36:13

Is your transaction id more or less monotonic according to the date? If so, something like the next can help:

with tenders_filtered as (select * from tenders where country = 'Mexico'

and "date" >= '2023-01-01' and "date" < '2024-01-01')

Select * from tenders_filtered inner join items on transaction_id = tender_transaction_id
where tender_transaction_id between (select min(transaction_id) from tenders_filtered) and (select max(transaction_id) from tenders_filtered)

This assumes you have an index on items(tender_transaction_id) and it would be able to select a small subset (less than say 5%) of the table.

If your transaction_id is not monotonic, you can consider having something monotonic or even additional denormalized field(s) with country and/or date to your items.

Another option is to use a windowing function to get the count, e.g.

Select *,count(*) OVER () as cnt from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'

This would at least save you from doing a second call.

пн, 26 лют. 2024 р. о 16:26 Chema <chema@interneta.org> пише:

Dear pgsqlers,

I'm trying to optimize simple queries on two tables (tenders & items) with a couple million records. Besides the resulting records, the app also displays the count of total results. Doing count() takes as much time as the other query (which can be 30+ secs), so it's an obvious target for optimization. I'm already caching count() results for the most common conditions (country & year) in a material table, which practically halves response time. The tables are updated sparingly, and only with bulk COPYs. Now I'm looking for ways to optimize queries with other conditions.

Reading around, seems many people are still using this 2005 snippet to obtain the row count estimate from Explain:
CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE  rec   record;  rows  integer;
BEGIN  FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP    rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');    EXIT WHEN rows IS NOT NULL;  END LOOP;  RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;
Is this still the current best practice? Any tips to increase precision? Currently it can estimate the actual number of rows for over or under a million, as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead of 1,292,010).

Any other tips to improve the query are welcome, of course. There's a big disparity between the two sample queries plans even though only the filtered country changes.

I already raised default_statistics_target up to 2k (the planner wasn't using indexes at all with low values). Gotta get it even higher? These are my custom settings:

shared_buffers = 256MB # min 128kB
work_mem = 128MB # min 64kB
maintenance_work_mem = 254MB # min 1MB
effective_cache_size = 2GB
default_statistics_target = 2000
random_page_cost = 1.0 # same scale as above

Sample query:

Explain Analyze
Select * from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'
QUERY PLAN
Gather (cost=253837.99..1506524.32 rows=1955297 width=823) (actual time=51433.592..63239.809 rows=1001200 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=252837.99..1309994.62 rows=814707 width=823) (actual time=51361.920..61729.142 rows=333733 loops=3)
Hash Cond: (items.tender_transaction_id = tenders.transaction_id)
-> Parallel Seq Scan on items (cost=0.00..1048540.46 rows=3282346 width=522) (actual time=1.689..56887.108 rows=2621681 loops=3)
-> Parallel Hash (cost=247919.56..247919.56 rows=393475 width=301) (actual time=2137.473..2137.476 rows=333733 loops=3)
Buckets: 1048576 Batches: 1 Memory Usage: 219936kB
-> Parallel Bitmap Heap Scan on tenders (cost=16925.75..247919.56 rows=393475 width=301) (actual time=385.315..908.865 rows=333733 loops=3)
Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
Heap Blocks: exact=24350
-> Bitmap Index Scan on tenders_country_and_date_index (cost=0.00..16689.67 rows=944339 width=0) (actual time=423.213..423.214 rows=1001200 loops=1)
Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
Planning Time: 12.784 ms
JIT:
Functions: 33
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521 ms, Emission 651.442 ms, Total 2072.987 ms
Execution Time: 63378.033 ms

Explain Analyze
Select * from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Mexico'
and "date" >= '2023-01-01' and "date" < '2024-01-01'
QUERY PLAN
Gather (cost=1000.99..414258.70 rows=162080 width=823) (actual time=52.538..7006.128 rows=1292010 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.99..397050.70 rows=67533 width=823) (actual time=40.211..4087.081 rows=430670 loops=3)
-> Parallel Index Scan using tenders_country_and_date_index on tenders (cost=0.43..45299.83 rows=32616 width=301) (actual time=4.376..59.760 rows=1218 loops=3)
Index Cond: ((country = 'Mexico'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
-> Index Scan using items_tender_transaction_id_index on items (cost=0.56..10.67 rows=11 width=522) (actual time=0.321..3.035 rows=353 loops=3655)
Index Cond: (tender_transaction_id = tenders.transaction_id)
Planning Time: 7.808 ms
JIT:
Functions: 27
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 17.785 ms, Inlining 0.000 ms, Optimization 5.080 ms, Emission 93.274 ms, Total 116.138 ms
Execution Time: 7239.427 ms

Thanks in advance!

Re: Optimizing count(), but Explain estimates wildly off

From

Laurenz Albe

Date:

27 February 2024, 07:40:43

On Mon, 2024-02-26 at 18:25 -0600, Chema wrote:
> I'm trying to optimize simple queries on two tables (tenders & items) with a couple
> million records.  Besides the resulting records, the app also displays the count of
> total results.  Doing count() takes as much time as the other query (which can be
> 30+ secs), so it's an obvious target for optimization.
>
> Reading around, seems many people are still using this 2005 snippet to obtain the
> row count estimate from Explain:

I recommend using FORMAT JSON and extracting the top row count from that.  It is
simpler and less error-prone.

> Is this still the current best practice?  Any tips to increase precision?
> Currently it can estimate the actual number of rows for over or under a million,
> as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead
> of 1,292,010).

Looking at the samples you provided, I get the impression that the statistics for
the table are quite outdated.  That will affect the estimates.  Try running ANALYZE
and see if that improves the estimates.

Yours,
Laurenz Albe

Re: Optimizing count(), but Explain estimates wildly off

From

Alvaro Herrera

Date:

27 February 2024, 13:11:42

Hi Chema,

On 2024-Feb-26, Chema wrote:

> Dear pgsqlers,
> 
> I'm trying to optimize simple queries on two tables (tenders & items) with
> a couple million records.  Besides the resulting records, the app also
> displays the count of total results.  Doing count() takes as much time as
> the other query (which can be 30+ secs), so it's an obvious target for
> optimization.  I'm already caching count() results for the most common
> conditions (country & year) in a material table, which practically halves
> response time.  The tables are updated sparingly, and only with bulk
> COPYs.  Now I'm looking for ways to optimize queries with other conditions.

It sounds like this approach might serve your purposes:

https://www.postgresql.eu/events/pgconfeu2023/schedule/session/4762-counting-things-at-the-speed-of-light-with-roaring-bitmaps/

> I already raised default_statistics_target up to 2k (the planner wasn't
> using indexes at all with low values).  Gotta get it even higher? These are
> my custom settings:

I would recommend to put the default_statistics_target back to its
original value and modify the value with ALTER TABLE .. SET STATISTICS
only for columns that need it, only on tables that need it; then ANALYZE
everything.  The planner gets too slow if you have too many stats for
everything.

> shared_buffers = 256MB                  # min 128kB

This sounds far too low, unless your server is a Raspberry Pi or
something.  See "explain (buffers, analyze)" of your queries to see how
much buffer traffic is happening for them.

> Functions: 33
> Options: Inlining true, Optimization true, Expressions true, Deforming true
> Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521
> ms, Emission 651.442 ms, Total 2072.987 ms
> Execution Time: 63378.033 ms

Also maybe experiment with turning JIT off.  Sometimes it brings no
benefit and slows down execution pointlessly.  Here you spent two
seconds JIT-compiling the query; were they worth it?

Cheers

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: Optimizing count(), but Explain estimates wildly off

From

Chema

Date:

29 February 2024, 23:15:42

> Reading around, seems many people are still using this 2005 snippet to obtain the
> row count estimate from Explain:

I recommend using FORMAT JSON and extracting the top row count from that. It is
simpler and less error-prone.

Good tip, thanks Laurenze!

> Is this still the current best practice? Any tips to increase precision?
> Currently it can estimate the actual number of rows for over or under a million,
> as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead
> of 1,292,010).

Looking at the samples you provided, I get the impression that the statistics for
the table are quite outdated. That will affect the estimates. Try running ANALYZE
and see if that improves the estimates.

No major changes after doing Analyze, and also Vacuum Analyze. Seems something is seriously off. I pimped my config thanks to Alvaro's prompting, set default statistics = 500 (suggested for warehouse dbs) but raised pertinent columns from 2,000 to 5,000 (will play with disabling JIT or raising cost later):

shared_buffers = 2GB # ~0.25 * RAM, dedicated cache, hard allocation (requires restart)
effective_cache_size = 6GB # 0.5-0.75 RAM (free -h: free + cache + shared_buffers)
work_mem = 128MB # RAM * 0.25 / max_connections.
maintenance_work_mem = 512MB
default_statistics_target = 500 # def 100, higher to make planner use indexes in big warehouse tables.
random_page_cost = 1.1 # Random reads in SSD cost almost as little as sequential ones

Analized again (1.5M samples instead of 600k):

"tenders": scanned 216632 of 216632 pages, containing 3815567 live rows and 0 dead rows; 1500000 rows in sample, 3815567 estimated total rows
"items": scanned 995023 of 995023 pages, containing 7865043 live rows and 0 dead rows; 1500000 rows in sample, 7865043 estimated total rows

but same deal:

-- After config pimp 1,959,657 instead of 1,001,200 45,341.654 ms

Gather (cost=247031.70..1479393.82 rows=1959657 width=824) (actual time=8464.691..45257.435 rows=1001200 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Parallel Hash Join (cost=246031.70..1282428.12 rows=816524 width=824) (actual time=8413.057..44614.153 rows=333733 loops=3)

Hash Cond: (pricescope_items.tender_transaction_id = pricescope_tenders.transaction_id)

-> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101 width=522) (actual time=0.753..41654.507 rows=2621681 loops=3)

-> Parallel Hash (cost=241080.20..241080.20 rows=396120 width=302) (actual time=995.247..995.250 rows=333733 loops=3)

Buckets: 1048576 Batches: 1 Memory Usage: 219904kB

-> Parallel Bitmap Heap Scan on pricescope_tenders (cost=17516.10..241080.20 rows=396120 width=302) (actual time=162.898..321.472 rows=333733 loops=3)

Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))

Heap Blocks: exact=34722

-> Bitmap Index Scan on pricescope_tenders_country_and_date_index (cost=0.00..17278.43 rows=950688 width=0) (actual time=186.536..186.537 rows=1001200 loops=1)

Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))

Planning Time: 11.310 ms

JIT:

Functions: 33

Options: Inlining true, Optimization true, Expressions true, Deforming true

Timing: Generation 8.608 ms, Inlining 213.375 ms, Optimization 557.351 ms, Emission 417.568 ms, Total 1196.902 ms

Execution Time: 45341.654 ms

BUT if I force the planner to ignore 'country' statistics:

-- Subselect country to hide constant from planner, so it doesn't use statistics

Explain Analyze

Select * from pricescope_tenders inner join pricescope_items on transaction_id = tender_transaction_id

where country = (select 'Colombia')

and "date" >= '2023-01-01' and "date" < '2024-01-01'

;

Then I get the same plan than if I filter for Mexico, with a similar run time:

-- Colombia in subselect 428,623 instead of 1,001,200 6674.860 ms

Gather (cost=1001.00..570980.73 rows=428623 width=824) (actual time=166.785..6600.673 rows=1001200 loops=1)

Workers Planned: 2

Params Evaluated: $0

Workers Launched: 2

InitPlan 1 (returns $0)

-> Result (cost=0.00..0.01 rows=1 width=32) (actual time=166.031..166.033 rows=1 loops=1)

-> Nested Loop (cost=0.99..527118.42 rows=178593 width=824) (actual time=200.511..5921.585 rows=333733 loops=3)

-> Parallel Index Scan using pricescope_tenders_country_and_date_index on pricescope_tenders (cost=0.43..104391.64 rows=86641 width=302) (actual time=200.388..400.882 rows=333733 loops=3)

Index Cond: ((country = $0) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))

-> Index Scan using pricescope_items_tender_transaction_id_index on pricescope_items (cost=0.56..4.83 rows=5 width=522) (actual time=0.016..0.016 rows=1 loops=1001200)

Index Cond: (tender_transaction_id = pricescope_tenders.transaction_id)

Planning Time: 7.372 ms

JIT:

Functions: 31

Options: Inlining true, Optimization true, Expressions true, Deforming true

Timing: Generation 6.981 ms, Inlining 209.470 ms, Optimization 308.123 ms, Emission 248.176 ms, Total 772.750 ms

Execution Time: 6674.860 ms

So runtime is now decent; stats are still way off by -670k, tho I guess that's better than +1M.

1. Any tips to fix stats?

2. Or a better way of making the planner go for index scans for country?

Thanks again!

Re: Optimizing count(), but Explain estimates wildly off

From

Laurenz Albe

Date:

01 March 2024, 08:57:13

On Thu, 2024-02-29 at 17:15 -0600, Chema wrote:
> No major changes after doing Analyze, and also Vacuum Analyze.

Indeed.

This caught my attention:

> ->  Parallel Seq Scan on pricescope_items  (cost=0.00..1027794.01 rows=3277101 width=522) (actual
time=0.753..41654.507rows=2621681 loops=3) 

Why does it take over 41 seconds to read a table with less than
3 million rows?  Are the rows so large?  Is the tabe bloated?
What is the size of the table as measured with pg_relation_size()
and pg_table_size()?

Yours,
Laurenz Albe

Re: Optimizing count(), but Explain estimates wildly off

From

Chema

Date:

04 March 2024, 19:13:56

> -> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101 width=522) (actual time=0.753..41654.507 rows=2621681 loops=3)

Why does it take over 41 seconds to read a table with less than
3 million rows? Are the rows so large? Is the tabe bloated?
What is the size of the table as measured with pg_relation_size()
and pg_table_size()?

There's one JSON column in each table with a couple fields, and a column with long texts in Items.

-- pg_table_size, pg_relation_size, pg_indexes_size, rows
name	table_size	relation_size	index_size	row_estimate
tenders	1,775,222,784		1,630,461,952	3,815,567
items	8,158,773,248		6,052,470,784	7,865,043

check_postgres gave a 1.4 bloat score to tenders, 1.9 to items. I had a duplicate index on transaction_id (one hand made, other from the unique constraint) and other text column indexes with 0.3-0.5 bloat scores. After Vacuum Full Analyze; sizes are greatly reduced, specially Items:

-- pg_table_size, pg_relation_size, pg_indexes_size, rows
name	table_size	relation_size	index_size	row_estimate
tenders	1,203,445,760	1,203,421,184	500,482,048	3,815,567
items	4,436,189,184	4,430,790,656	2,326,118,400	7,865,043

There were a couple mass deletions which probably caused the bloating. Autovacuum is on defaults, but I guess it doesn't take care of that. Still, performance seems about the same.

The planner is now using an Index Scan for Colombia without the subselect hack, but subselect takes ~200ms less in avg, so might as well keep doing it.

Row estimate is still +1M so still can't use that, but at least now it takes less than 10s to get the exact count with all countries.

Re: Optimizing count(), but Explain estimates wildly off

From

Greg Sabino Mullane

Date:

05 March 2024, 01:50:00

On Mon, Mar 4, 2024 at 2:14 PM Chema <chema@interneta.org> wrote:

There's one JSON column in each table with a couple fields, and a column with long texts in Items.

and earlier indicated the query was:

Select * from tenders inner join items

You do not want to do a "select star" on both tables unless you 100% need every single column and plan to actively do something with it. Especially true for large text and json columns. Also, use jsonb not json.

Cheers,

Greg

Re: Optimizing count(), but Explain estimates wildly off

From

Chema

Date:

05 March 2024, 14:00:00

El lun, 4 mar 2024 a la(s) 7:50 p.m., Greg Sabino Mullane (htamfids@gmail.com) escribió:

On Mon, Mar 4, 2024 at 2:14 PM Chema <chema@interneta.org> wrote:
There's one JSON column in each table with a couple fields, and a column with long texts in Items.
and earlier indicated the query was:
Select * from tenders inner join items

You do not want to do a "select star" on both tables unless you 100% need every single column and plan to actively do something with it. Especially true for large text and json columns. Also, use jsonb not json.

Tuples aren't really that long in avg (300 bytes for Tenders, twice as much for Items). In any case, the Select * was to be used with Explain to obtain an estimated row count instantly from stats, as described in my first email, but even raising stats to 5k in relevant columns has not improved the planner's estimates, which are off by almost 1M, and there's been no suggestion of what could cause that.

Googlin' once again, though, this SO answer implies that that might actually be the normal for anything but the simplest queries:

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.

Re: Optimizing count(), but Explain estimates wildly off

From

Greg Sabino Mullane

Date:

05 March 2024, 17:04:17

> columns has not improved the planner's estimates, which are off by almost

> 1M, and there's been no suggestion of what could cause that.

You are asking a lot of the planner - how would it know that the average number of items is much higher for ids derived indirectly from "Mexico" versus ids derived from "Columbia"?

One thing you could try just as a general performance gain is index-only scans, by creating an index like this:

create index tenders_date_country_id on tenders (country, "date") include (transaction_id);

>> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101 width=522)

>> (actual time=0.753..41654.507 rows=2621681 loops=3)

> Why does it take over 41 seconds to read a table with less than 3 million rows?

Good question. I still maintain it's because you are doing a 'select star' on large, toasted rows.

I made two tables of the same approximate number of rows, and ran the query. It returned a hash join containing:

-> Parallel Seq Scan on items (cost=0.00..69602.93 rows=3375592 width=8)

(actual time=0.015..185.414 rows=2700407 loops=3)

Then I boosted the width by a lot by adding some filled text columns, and it returned the same number of rows, but much slower:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715 width=1562)

(actual time=0.027..36693.986 rows=2700407 loops=3)

A second run with everything in cache was better, but still an order of magnitude worse the small row:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715 width=1562)

(actual time=0.063..1565.486 rows=2700407 loops=3)

Best of all was a "SELECT 1" which switched the entire plan to a much faster merge join, resulting in:

-> Parallel Index Only Scan using items_tender_transaction_id_index on items (cost=0.43..101367.60 rows=3372717 width=4)

(actual time=0.087..244.878 rows=2700407 loops=3)

Yours will be different, as I cannot exactly duplicate your schema or data distribution, but give "SELECT 1" a try. This was on Postgres 16, FWIW, with a default_statistics_target of 100.

Cheers,

Greg

Re: Optimizing count(), but Explain estimates wildly off

From

Chema

Date:

06 March 2024, 14:00:00

Yours will be different, as I cannot exactly duplicate your schema or data distribution, but give "SELECT 1" a try. This was on Postgres 16, FWIW, with a default_statistics_target of 100.

Select 1 produces a sequential scan, like Select * did before Vacuum Full. But if I force an index scan with the subquery hack, there's a significant improvement over Select *. Row estimate is still -50%|200%, so seems it's only accurate for very simple queries indeed. In conclusion, I'll just keep on count(*)ing with the subquery hack. Funny thing, Select 1 is slightly faster than Select count(*), so I'm tempted to do Select count(*) From (Select 1...) As iluffsubqueries. xD

(pg_roaringbitmap looks great, but I expect it works with fixed categories, while I have several full text search columns)

-- With previous country,date index
query x 100 | avg | min | q1 | median | q3 | max
------------------------+--------------------+-------------------+-------------------+--------------------+-------------------+--------------------
Count Colombia | 9093.918731212616 | 6334.060907363892 | 7366.191983222961 | 9154.448866844177 | 10276.342272758484 | 13520.153999328613
Subquery Colombia | 7926.021897792816 | 5926.224946975708 | 7000.077307224274 | 7531.211018562317 | 8828.327298164368 | 11380.73992729187
Sel* Colombia | 8694.387829303741 | 6963.425874710083 | 8149.151265621185 | 8704.618453979492 | 9153.236508369446 | 11787.146806716919
Sel* Subquery Colombia | 8622.495520114899 | 6959.257125854492 | 8179.068505764008 | 8765.061974525452 | 9159.55775976181 | 10187.61420249939
Sel1 Colombia | 22717.704384326935 | 8666.495084762573 | 22885.42276620865 | 23949.790477752686 | 24966.21882915497 | 30625.644207000732

Sel1 Subquery Colombia | 7529.951772689819 | 6241.269111633301 | 7127.403438091278 | 7577.62348651886 | 7866.843640804291 | 8954.48899269104
;

-- After including transaction_id in country,date index
query x 20 | avg | min | q1 | median | q3 | max
------------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------
Count Colombia | 10326.94479227066 | 7079.586982727051 | 8091.441631317139 | 10685.971021652222 | 11660.240888595581 | 16219.580888748169
Subquery Colombia | 8345.360279083252 | 6759.0179443359375 | 7150.483548641205 | 7609.055519104004 | 8118.529975414276 | 15819.210052490234
Sel* Colombia | 9401.683914661407 | 8350.785970687866 | 8727.016389369965 | 9171.823978424072 | 9705.730974674225 | 12684.055089950562
Sel* Subquery Colombia | 10874.297595024109 | 7996.103048324585 | 9317.362785339355 | 10767.66049861908 | 12130.92851638794 | 14003.422021865845
Sel1 Colombia | 14704.787838459015 | 7033.560991287231 | 8938.009798526764 | 11308.07101726532 | 21711.08090877533 | 25156.877994537354

Sel1 Subquery Colombia | 7128.487503528595 | 5076.292991638184 | 5678.286790847778 | 6925.720572471619 | 8272.867858409882 | 11430.468082427979

query x 100 | avg | min | q1 | median | q3 | max
------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------
Count Colombia | 8165.0702357292175 | 5923.334121704102 | 6800.160050392151 | 7435.7980489730835 | 9075.710475444794 | 13613.409042358398
Subquery Colombia | 7299.517266750336 | 5389.672040939331 | 6362.253367900848 | 6781.42237663269 | 7978.189289569855 | 11542.781829833984
Sel* Colombia | 14157.406282424927 | 8775.223016738892 | 13062.03180551529 | 14233.824968338013 | 15513.144373893738 | 19184.97586250305
Sel* Subquery Colombia | 13438.675961494446 | 10216.159105300903 | 12183.876752853394 | 13196.363925933838 | 14356.310486793518 | 20111.860036849976
Sel1 Colombia | 13753.776743412018 | 7020.914793014526 | 7893.3587074279785 | 9101.168870925903 | 22971.67855501175 | 26913.809061050415
Sel1 Subquery Colombia | 6757.480027675629 | 5529.844045639038 | 6212.466478347778 | 6777.510046958923 | 7212.876975536346 | 8500.235080718994