Home > mailing lists

Thread: Faster distinct query?

Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 20:05:22

I was wondering if there was any way to improve the performance of this query:

SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;

The explain execution plan can be found here:

https://explain.depesz.com/s/mtxB#html

and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.

To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).

This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

Re: Faster distinct query?

From

"David G. Johnston"

Date:

22 September 2021, 20:20:28

On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).

I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table. Grouping and array-ifying the lookup table would be trivial. Either modify the application code or add a trigger to populate the lookup table as needed.

The parentheses around channel in "array_agg(distinct(channel))" are unnecessary - you are invoking composite-type syntax, which is ignored in the single column case unless you write the optional ROW keyword, i.e., distinct ROW(channel)

David J.

Re: Faster distinct query?

From

Michael Lewis

Date:

22 September 2021, 20:20:46

In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables, what type of system you are running it on, any changes from default configs, etc.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels

FROM(

SELECT station,channel FROM data GROUP BY station,channel

) AS sub

group by station;

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices, knowing that channel is dependent on station perhaps. I wouldn't necessarily think that it would help this query, but perhaps others. Also, you might try creating only dependencies, only ndistinct type, or some combination other than all 3 types.

ref- https://www.postgresql.org/docs/current/sql-createstatistics.html

Re: Faster distinct query?

From

"David G. Johnston"

Date:

22 September 2021, 20:26:55

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:

In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 20:41:15

On Sep 22, 2021, at 12:20 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).

I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table. Grouping and array-ifying the lookup table would be trivial. Either modify the application code or add a trigger to populate the lookup table as needed.

Makes sense. I was actually considering this approach (albeit without the foreign key - that’s a nice additional safety measure), but was concerned about the overhead that adding said trigger would have on inserts - thus my thought to try the materialized view. As a reference, this database is receiving 1Hz data from around 170 stations, with up to three channels of data per station. So something like 350-500 inserts per second, although the data is “grouped” into 10 minute batches. I’ll give it another look.

The parentheses around channel in "array_agg(distinct(channel))" are unnecessary - you are invoking composite-type syntax, which is ignored in the single column case unless you write the optional ROW keyword, i.e., distinct ROW(channel)

Good information, thanks!

David J.

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

Re: Faster distinct query?

From

Tom Lane

Date:

22 September 2021, 20:48:42

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> There is no where clause so I'm doubtful there is much to be gained going
> down this path.  The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.
It's possible that a preliminary VACUUM to get page-all-visible hint
bits set would be a net win.

            regards, tom lane

Re: Faster distinct query?

From

Ryan Booz

Date:

22 September 2021, 20:49:29

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 21:21:22

> On Sep 22, 2021, at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
>> There is no where clause so I'm doubtful there is much to be gained going
>> down this path.  The Index-Only scan seems like an optimal way to obtain
>> this data and the existing query already does that.
>
> The "index-only" scan is reported to do 86m heap fetches along the
> way to returning 812m rows, so the data is apparently pretty dirty.
> It's possible that a preliminary VACUUM to get page-all-visible hint
> bits set would be a net win.

I do have autovaccum turned on, but perhaps I need to do a manual? The initial population of the database was
accomplishedvia logical replication from a different database cluster (needed to move this database to more dedicated
hardware),so perhaps that left the database in a state that autovaccum doesn’t address? Or perhaps my autovaccum
settingsaren’t kosher - I haven’t adjusted that portion of the config any. 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
>             regards, tom lane

Re: Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 21:35:35

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Ryan Booz

Date:

22 September 2021, 21:50:07

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 21:53:48

On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Ryan Booz

Date:

22 September 2021, 22:05:19

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Something like:

station | channel
----------|-----------

1 1

1 2

2 3

2 4

or:

station | channel
----------|-----------

1 1

1 2

2 1

2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Israel Brewster

Date:

22 September 2021, 22:22:47

On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.

Something like:

station | channel
----------|-----------
1 1
1 2
2 3
2 4

or:
station | channel
----------|-----------
1 1
1 2
2 1
2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Michael Lewis

Date:

22 September 2021, 22:51:51

On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> There is no where clause so I'm doubtful there is much to be gained going
> down this path. The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.

Do you say that because you would expect many more than 10 tuples per page?

Re: Faster distinct query?

From

Tom Lane

Date:

22 September 2021, 23:48:39

Michael Lewis <mlewis@entrata.com> writes:
> On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The "index-only" scan is reported to do 86m heap fetches along the
>> way to returning 812m rows, so the data is apparently pretty dirty.

> Do you say that because you would expect many more than 10 tuples per page?

No, I say that because if the table were entirely all-visible, there
would have been *zero* heap fetches.  As it stands, it's reasonable
to suspect that a pretty sizable fraction of the index-only scan's
runtime went into random-access heap fetches made to verify
visibility of individual rows.

(You will, of course, never get to exactly zero heap fetches in an
IOS unless the table data is quite static.  But one dirty page
out of every ten seems like there were a lot of recent changes.
A VACUUM to clean that up might be well worthwhile.)

            regards, tom lane

Re: Faster distinct query?

From

David Rowley

Date:

23 September 2021, 00:58:31

On Thu, 23 Sept 2021 at 08:21, Michael Lewis <mlewis@entrata.com> wrote:
> select station, array_agg(distinct(channel)) as channels
> FROM(
> SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;

Since the subquery is grouping by station, channel, then there's no
need for the DISTINCT in the aggregate function. Removing that should
remove some tuplestore overhead from the aggregate node.

David

Re: Faster distinct query?

From

David Rowley

Date:

23 September 2021, 01:10:38

On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
>> If there is correlation between station & channel, then you might look at creating a multivariate statistics object
andanalyzing the table so the planner can make better choices 
>
>
> There is no where clause so I'm doubtful there is much to be gained going down this path.  The Index-Only scan seems
likean optimal way to obtain this data and the existing query already does that.  The aggregation path might vary
thoughit seems like that shouldn't be the case here. 

ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel.   We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there.  However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY.  I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

select station, array_agg(channel) as channels
FROM(
   SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.

David

Re: Faster distinct query?

From

Israel Brewster

Date:

23 September 2021, 01:21:53

On Sep 22, 2021, at 5:10 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel. We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there. However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY. I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

select station, array_agg(channel) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here: https://explain.depesz.com/s/L5Bf It looks more complicated, but being able to run parallel definitely makes a difference, and there may be some other improvements in there that I’m not aware of as well!

Still not quite fast enough for real-time queries, but certainly fast enough to keep a materialized view updated.

And this is why I love postgresql and this community - when something isn’t working as well as I would like, there is usually a way to improve it drastically :-)

Thanks!

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.

David

Re: Faster distinct query?

From

"David G. Johnston"

Date:

23 September 2021, 01:28:32

On Wednesday, September 22, 2021, David Rowley <dgrowleyml@gmail.com> wrote:

I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

Agreed, though if the query author needs to do that here we’ve violated the spirit of the declarative SQL language. At first blush nothing about the original query seems like it should be preventing parallelism. Each worker builds its own distinct array then the final concatenation is made distinct.

David J.

Re: Faster distinct query?

From

David Rowley

Date:

23 September 2021, 01:31:36

On Thu, 23 Sept 2021 at 13:28, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Agreed, though if the query author needs to do that here we’ve violated the spirit of the declarative SQL language.
Atfirst blush nothing about the original query seems like it should be preventing parallelism.  Each worker builds its
owndistinct array then the final concatenation is made distinct. 

We don't parallelise DISTINCT / ORDER BY aggregates.

David

Re: Faster distinct query?

From

David Rowley

Date:

23 September 2021, 01:41:52

On Thu, 23 Sept 2021 at 13:21, Israel Brewster <ijbrewster@alaska.edu> wrote:
> Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here:
https://explain.depesz.com/s/L5BfIt looks more complicated, but being able to run parallel definitely makes a
difference,and there may be some other improvements in there that I’m not aware of as well!

That's good. You should also look into the VACUUM thing mentioned by
Tom. If this table is just receiving INSERTs and not UPDATE/DELETEs
then you might want to consider tweaking the auto-vacuum settings for
it.

The default autovacuum_vacuum_insert_scale_factor will mean that
auto-vacuum will only kick off a worker to vacuum this table when 20%
of the total rows have been inserted since the last vacuum. It's
possible that might account for your large number of heap fetches.

If the table is insert-only, then you could drop the
autovacuum_vacuum_insert_scale_factor down a bit. In the command
below, I set it to 2%. Also dropping the autovacuum_freeze_min_age is
a pretty good thing to do for tables that are never or almost never
are UPDATEd or DELETEd from.

alter table data set (autovacuum_vacuum_insert_scale_factor=0.02,
autovacuum_freeze_min_age=0);

Vacuuming an insert-only table more often is not a great deal of extra
work, and it's possible even less work if you were to vacuum before
recently inserted pages got evicted from shared_buffers or the
kernel's buffers. The already vacuumed and frozen portion of the
table will be skipped using the visibility and freeze map, which is
very fast to do.

David

Re: Faster distinct query?

From

Ryan Booz

Date:

23 September 2021, 01:49:48

Sorry - break for dinner! So much happens on a global scale in a few hours. :-)!

I took a few minutes and created a simple example here of what I imagine you have on that table. I only inserted ~80 million rows of test data, but hopefully, it's somewhat representative.

TimescaleDB's current implementation of SkipScan only allows distinct on one column, and because of where we can place the hook to read the query cost, a LATERAL JOIN (or similar) can't be used to get both columns like you want. So, to outsmart the planner, you can get your results with one of the DISTINCT queries in a function. I realize this is getting a bit specific, so it might not be an exact fit for you, but this query comes back on my 78 million rows in 67ms. YMMV

Step 1: Create a function that returns the array of channels per station

CREATE FUNCTION channel_array(TEXT) RETURNS text[]

AS $$

SELECT array_agg(channel) FROM (

select distinct on (channel) channel from stations where station=$1

) a

LANGUAGE SQL;

Step 2: Use a CTE for the distinct stations, querying the function for each station

WITH s1 AS (

SELECT DISTINCT ON (station) station FROM stations

)

SELECT station, channel_array(station) channel

FROM s1;

If it's using the index, you should see something like:

Subquery Scan on s1 (cost=0.57..16.22 rows=19 width=34) (actual time=0.580..4.809 rows=19 loops=1)

-> Unique (cost=0.57..11.28 rows=19 width=2) (actual time=0.043..0.654 rows=19 loops=1)

-> Custom Scan (SkipScan) on stations (cost=0.57..11.23 rows=19 width=2) (actual time=0.042..0.647 rows=19 loops=1)

-> Index Only Scan using idx_station_channel on stations (cost=0.57..1807691.34 rows=76000032 width=2) (actual time=0.040..0.641 rows=19 loops=1)

Index Cond: (station > NULL::text)

Heap Fetches: 19

HTH,

Ryan

On Wed, Sep 22, 2021 at 6:22 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.

Something like:

station | channel
----------|-----------
1 1
1 2
2 3
2 4

or:
station | channel
----------|-----------
1 1
1 2
2 1
2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.

How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.

If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.

David J.

Re: Faster distinct query?

From

Mladen Gogala

Date:

23 September 2021, 03:02:12

On 9/22/21 16:20, David G. Johnston wrote:
> I'd probably turn that index into a foreign key that just ensures that 
> every (station,channel) that appears in the data table also appears on 
> the lookup table.  Grouping and array-ifying the lookup table would be 
> trivial.  Either modify the application code or add a trigger to 
> populate the lookup table as needed.


I fully agree with this. Adding a trigger to populate a lookup table is 
a standard design in situations like this. Using "DISTINCT" almost 
always spells trouble for the performance.


-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: Faster distinct query?

From

hubert depesz lubaczewski

Date:

23 September 2021, 07:04:35

On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
> 
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
> 
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
> 
> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that
isa list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels
associatedwith each station). This query takes around 5 minutes to run.
 
> 
> To work around the issue, I created a materialized view that I can update periodically, and of course I can query
saidview in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will
alsogrow (correct me if I am wrong there).
 
> 
> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)

It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.

can you tell us how many rows has this:

select distinct station, channel from data;

If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.

Best regards,

depesz

Re: Faster distinct query?

From

Ryan Booz

Date:

23 September 2021, 12:34:54

Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.

This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan

On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:

On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
>
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>
> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.
>
> To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
>
> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)

It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.

can you tell us how many rows has this:

select distinct station, channel from data;

If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.

Best regards,

depesz

Re: Faster distinct query?

From

Israel Brewster

Date:

23 September 2021, 16:02:38

> On Sep 22, 2021, at 11:04 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote:
>
> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
>> I was wondering if there was any way to improve the performance of this query:
>>
>> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>>
>> The explain execution plan can be found here:
>> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>>
>> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that
isa list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels
associatedwith each station). This query takes around 5 minutes to run. 
>>
>> To work around the issue, I created a materialized view that I can update periodically, and of course I can query
saidview in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will
alsogrow (correct me if I am wrong there). 
>>
>> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
>
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
>
> can you tell us how many rows has this:
>
> select distinct station, channel from data;

At the moment, about 170, but I would expect it to stabilize at around 510 or less once I am pulling in all the
channels.Getting this query (or the stored/cached results thereof, as it shouldn’t change too often) working fast
enoughto be used in the live system is simply the first step to pulling in three times as much data (that’ll be fun!)   

>
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.

Sounds like something to look into. Of course, if I go with a lookup table, updated by an on insert trigger, it becomes
amoot point. I’ll have to spend some time wrapping my head around the concept, and figuring out how to write it so that
Iget distinct per station rather than just a straight up distinct, but theoretically at least it makes sense. 

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
> Best regards,
>
> depesz
>

Re: Faster distinct query?

From

Israel Brewster

Date:

23 September 2021, 16:16:38

On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:

Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.

This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan

Thanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!

If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"

table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.

I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.

If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).

Thanks again for all the suggestions!

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
>
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>
> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.
>
> To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
>
> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)

It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.

can you tell us how many rows has this:

select distinct station, channel from data;

If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.

Best regards,

depesz

Re: Faster distinct query?

From

Rob Sargent

Date:

23 September 2021, 16:33:36

On 9/23/21 10:16 AM, Israel Brewster wrote:

On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:

Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.

This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan

Thanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!

If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.

I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.

If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).

I would look into pre-loading the lookup table (and pre-emptive maintenance). Add the foreign key, but not the trigger.

Re: Faster distinct query?

From

Israel Brewster

Date:

23 September 2021, 16:45:09

On Sep 23, 2021, at 8:33 AM, Rob Sargent <robjsargent@gmail.com> wrote:

On 9/23/21 10:16 AM, Israel Brewster wrote:
On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:

Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.

This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan

Thanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!

If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.

I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.

If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).

I would look into pre-loading the lookup table (and pre-emptive maintenance). Add the foreign key, but not the trigger.

That makes sense. Thanks!

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

Re: Faster distinct query?

From

Rob Sargent

Date:

23 September 2021, 17:08:38

I would look into pre-loading the lookup table (and pre-emptive 
maintenance).  Add the foreign key, but not the trigger.
>
> That makes sense. Thanks!
>
Yeah, then I got to wondering: Do you care?  Are these stations likely 
to be spoofed?  You have the station id and type in you data table and 
essentially the same in your lookup table.  If you're not replacing the 
id+type in your data table with a lookup id you really don't need to 
even have a foreign key.  Maybe sync them regularly but I'm not seeing 
the value in the runtime overhead.  Now presumably the station table is 
entirely pinned in memory and foreign key check might not be much 
overhead but it won't be zero.

Re: Faster distinct query?

From

Geoff Winkless

Date:

23 September 2021, 18:36:48

On Wed, 22 Sept 2021 at 21:05, Israel Brewster <ijbrewster@alaska.edu> wrote:

I was wondering if there was any way to improve the performance of this query:

SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;

If you have tables of possible stations and channels (and if not, why not?), then an EXISTS query, something like

SELECT stations.name, ARRAY_AGG(channels.name)
FROM stations, channels
WHERE EXISTS
(SELECT FROM data WHERE data.channels=channels.name AND data.station=stations.name)
GROUP BY stations.name

will usually be much faster, because it can stop scanning after the first match in the index.

Geoff

Re: Faster distinct query?

From

Israel Brewster

Date:

23 September 2021, 20:17:31

On Sep 23, 2021, at 10:36 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:

On Wed, 22 Sept 2021 at 21:05, Israel Brewster <ijbrewster@alaska.edu> wrote:
I was wondering if there was any way to improve the performance of this query:

SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;

If you have tables of possible stations and channels (and if not, why not?), then an EXISTS query, something like

SELECT stations.name, ARRAY_AGG(channels.name)
FROM stations, channels
WHERE EXISTS
(SELECT FROM data WHERE data.channels=channels.name AND data.station=stations.name)
GROUP BY stations.name

will usually be much faster, because it can stop scanning after the first match in the index.

So that one ran in about 5 minutes as well - apparently the time it takes to scan the index, given the similarity of run times for each of the different queries: https://explain.depesz.com/s/w46h

It’s making me think though, because this similar (though incomplete, and therefore incorrect result) query runs in only 19ms (https://explain.depesz.com/s/iZnN):

SELECT

stations.name,

array_agg(channels.channel)

FROM stations,channels

WHERE EXISTS (SELECT

FROM data

WHERE data.station=stations.id)

GROUP BY stations.name

It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145

Geoff

Re: Faster distinct query?

From

Michael Lewis

Date:

24 September 2021, 04:55:29

It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?

Did you do the manual vacuum as suggested by Tom? Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how many records have changed since then. The system is deciding on an index scan because that should be very fast, but if it needs to look at the actual table table to determine if the process executing that query should still be allowed to see that tuple (row version) then the index only scan slows down a lot I believe. The huge number of heap fetches that Tom pointed out mean that the system doesn't know that ALL processes should be able to see those tuples. As someone else suggested lowering the max freeze age, you might just do a manual "vacuum freeze" as well to allow future auto vacuum processes to quickly skip over those pages as well.

I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?

Re: Faster distinct query?

From

Israel Brewster

Date:

24 September 2021, 18:03:21

On Sep 23, 2021, at 8:55 PM, Michael Lewis <mlewis@entrata.com> wrote:

It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?

Did you do the manual vacuum as suggested by Tom?

I ran a VACUUM ANALYZE, yes.

Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how many records have changed since then.

volcano_seismology=# SELECT seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan   | 95
seq_tup_read   | 25899340540
idx_scan   | 728372
idx_tup_fetch     | 51600217033
n_tup_ins     | 840283699
n_tup_upd     | 66120702
n_tup_del     | 2375651
n_tup_hot_upd     | 0
n_live_tup   | 839266956
n_dead_tup   | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum  | 24890460
last_vacuum     | 2021-09-22 21:32:11.367855+00
last_autovacuum     | 2021-09-14 07:13:23.745862+00
last_analyze   | 2021-09-22 21:32:21.071092+00
last_autoanalyze   | 2021-09-21 11:54:36.924762+00
vacuum_count   | 1
autovacuum_count   | 1
analyze_count     | 1
autoanalyze_count   | 2

Note that the update count was due to a (hopefully) one-time process where I had to change the value of a bunch of records. Generally this *should be* an insert-once-read-many database.

The system is deciding on an index scan because that should be very fast, but if it needs to look at the actual table table to determine if the process executing that query should still be allowed to see that tuple (row version) then the index only scan slows down a lot I believe. The huge number of heap fetches that Tom pointed out mean that the system doesn't know that ALL processes should be able to see those tuples. As someone else suggested lowering the max freeze age, you might just do a manual "vacuum freeze" as well to allow future auto vacuum processes to quickly skip over those pages as well.

Ok, ran a VACUUM (ANALYZE, FREEZE), and am now showing this from pg_stat_user_tables:

However, adding the AND data.channels=channels.channel to the query still makes it take around 5 minutes (https://explain.depesz.com/s/7hb1). So, again, running VACUUM didn’t appear to help any.

Also perhaps interestingly, if I again modify the query to only match on channel, not station, it is again fast (though not quite as fast): https://explain.depesz.com/s/HLb8

So, basically, I can quickly get a list of all channels for which I have data, or all stations for which I have data, but getting a list of all channels for each station is slow.

I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?

This is my stumbling block to implementing this option as well. That said, with the lookup table in place, perhaps all the rest of this is a moot point?

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320

Work: 907-474-5172
cell: 907-328-9145