Thread: Faster distinct query?
I was wondering if there was any way to improve the performance of this query:
SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
The explain execution plan can be found here:
and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.
To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table. Grouping and array-ifying the lookup table would be trivial. Either modify the application code or add a trigger to populate the lookup table as needed.
The parentheses around channel in "array_agg(distinct(channel))" are unnecessary - you are invoking composite-type syntax, which is ignored in the single column case unless you write the optional ROW keyword, i.e., distinct ROW(channel)
David J.
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables, what type of system you are running it on, any changes from default configs, etc.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;
If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices, knowing that channel is dependent on station perhaps. I wouldn't necessarily think that it would help this query, but perhaps others. Also, you might try creating only dependencies, only ndistinct type, or some combination other than all 3 types.
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,
I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;
Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.
If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices
There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.
David J.
On Sep 22, 2021, at 12:20 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster <ijbrewster@alaska.edu> wrote:To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table. Grouping and array-ifying the lookup table would be trivial. Either modify the application code or add a trigger to populate the lookup table as needed.
Makes sense. I was actually considering this approach (albeit without the foreign key - that’s a nice additional safety measure), but was concerned about the overhead that adding said trigger would have on inserts - thus my thought to try the materialized view. As a reference, this database is receiving 1Hz data from around 170 stations, with up to three channels of data per station. So something like 350-500 inserts per second, although the data is “grouped” into 10 minute batches. I’ll give it another look.
The parentheses around channel in "array_agg(distinct(channel))" are unnecessary - you are invoking composite-type syntax, which is ignored in the single column case unless you write the optional ROW keyword, i.e., distinct ROW(channel)
Good information, thanks!
David J.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
"David G. Johnston" <david.g.johnston@gmail.com> writes: > There is no where clause so I'm doubtful there is much to be gained going > down this path. The Index-Only scan seems like an optimal way to obtain > this data and the existing query already does that. The "index-only" scan is reported to do 86m heap fetches along the way to returning 812m rows, so the data is apparently pretty dirty. It's possible that a preliminary VACUUM to get page-all-visible hint bits set would be a net win. regards, tom lane
[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH
Ryan B
On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
> On Sep 22, 2021, at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "David G. Johnston" <david.g.johnston@gmail.com> writes: >> There is no where clause so I'm doubtful there is much to be gained going >> down this path. The Index-Only scan seems like an optimal way to obtain >> this data and the existing query already does that. > > The "index-only" scan is reported to do 86m heap fetches along the > way to returning 812m rows, so the data is apparently pretty dirty. > It's possible that a preliminary VACUUM to get page-all-visible hint > bits set would be a net win. I do have autovaccum turned on, but perhaps I need to do a manual? The initial population of the database was accomplishedvia logical replication from a different database cluster (needed to move this database to more dedicated hardware),so perhaps that left the database in a state that autovaccum doesn’t address? Or perhaps my autovaccum settingsaren’t kosher - I haven’t adjusted that portion of the config any. --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > > regards, tom lane
On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.
I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.
On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.
With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.
Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.
Something like:
station | channel
----------|-----------
Something like:
station | channel
----------|-----------
1 1
1 2
2 3
2 4
or:
station | channel
----------|-----------
----------|-----------
1 1
1 2
2 1
2 2
On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.
Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.
Something like:
station | channel
----------|-----------1 11 22 32 4or:station | channel
----------|-----------1 11 22 12 2On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> There is no where clause so I'm doubtful there is much to be gained going
> down this path. The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.
The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.
Do you say that because you would expect many more than 10 tuples per page?
Michael Lewis <mlewis@entrata.com> writes: > On Wed, Sep 22, 2021 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The "index-only" scan is reported to do 86m heap fetches along the >> way to returning 812m rows, so the data is apparently pretty dirty. > Do you say that because you would expect many more than 10 tuples per page? No, I say that because if the table were entirely all-visible, there would have been *zero* heap fetches. As it stands, it's reasonable to suspect that a pretty sizable fraction of the index-only scan's runtime went into random-access heap fetches made to verify visibility of individual rows. (You will, of course, never get to exactly zero heap fetches in an IOS unless the table data is quite static. But one dirty page out of every ten seems like there were a lot of recent changes. A VACUUM to clean that up might be well worthwhile.) regards, tom lane
On Thu, 23 Sept 2021 at 08:21, Michael Lewis <mlewis@entrata.com> wrote: > select station, array_agg(distinct(channel)) as channels > FROM( > SELECT station,channel FROM data GROUP BY station,channel > ) AS sub > group by station; Since the subquery is grouping by station, channel, then there's no need for the DISTINCT in the aggregate function. Removing that should remove some tuplestore overhead from the aggregate node. David
On Thu, 23 Sept 2021 at 08:27, David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote: >> If there is correlation between station & channel, then you might look at creating a multivariate statistics object andanalyzing the table so the planner can make better choices > > > There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems likean optimal way to obtain this data and the existing query already does that. The aggregation path might vary thoughit seems like that shouldn't be the case here. ndistinct extended statistics would be used to estimate the number of groups in the GROUP BY clause for the version of the query that performs GROUP BY station, channel. We've not seen the EXPLAIN ANALYZE for that query, so don't know if there's any use for extended statistics there. However, if the planner was to think there were more groups than there actually are, then it would be less inclined to do parallelise the GROUP BY. I think writing the query in such a way that allows it to be parallelised is likely going to result in some quite good performance improvements. i.e: select station, array_agg(channel) as channels FROM( SELECT station,channel FROM data GROUP BY station,channel ) AS sub group by station; which is just the same as Michael's version but without DISTINCT. Also, Tom's point about lots of heap fetches is going to count for quite a bit too, especially so if I/O plays a large part in the total query time. David
On Sep 22, 2021, at 5:10 PM, David Rowley <dgrowleyml@gmail.com> wrote:On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices
There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.
ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel. We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there. However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY. I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:
select station, array_agg(channel) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;
Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here: https://explain.depesz.com/s/L5Bf It looks more complicated, but being able to run parallel definitely makes a difference, and there may be some other improvements in there that I’m not aware of as well!
Still not quite fast enough for real-time queries, but certainly fast enough to keep a materialized view updated.
And this is why I love postgresql and this community - when something isn’t working as well as I would like, there is usually a way to improve it drastically :-)
Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.
David
On Wednesday, September 22, 2021, David Rowley <dgrowleyml@gmail.com> wrote:
I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:
Agreed, though if the query author needs to do that here we’ve violated the spirit of the declarative SQL language. At first blush nothing about the original query seems like it should be preventing parallelism. Each worker builds its own distinct array then the final concatenation is made distinct.
David J.
On Thu, 23 Sept 2021 at 13:28, David G. Johnston <david.g.johnston@gmail.com> wrote: > Agreed, though if the query author needs to do that here we’ve violated the spirit of the declarative SQL language. Atfirst blush nothing about the original query seems like it should be preventing parallelism. Each worker builds its owndistinct array then the final concatenation is made distinct. We don't parallelise DISTINCT / ORDER BY aggregates. David
On Thu, 23 Sept 2021 at 13:21, Israel Brewster <ijbrewster@alaska.edu> wrote: > Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here: https://explain.depesz.com/s/L5BfIt looks more complicated, but being able to run parallel definitely makes a difference,and there may be some other improvements in there that I’m not aware of as well! That's good. You should also look into the VACUUM thing mentioned by Tom. If this table is just receiving INSERTs and not UPDATE/DELETEs then you might want to consider tweaking the auto-vacuum settings for it. The default autovacuum_vacuum_insert_scale_factor will mean that auto-vacuum will only kick off a worker to vacuum this table when 20% of the total rows have been inserted since the last vacuum. It's possible that might account for your large number of heap fetches. If the table is insert-only, then you could drop the autovacuum_vacuum_insert_scale_factor down a bit. In the command below, I set it to 2%. Also dropping the autovacuum_freeze_min_age is a pretty good thing to do for tables that are never or almost never are UPDATEd or DELETEd from. alter table data set (autovacuum_vacuum_insert_scale_factor=0.02, autovacuum_freeze_min_age=0); Vacuuming an insert-only table more often is not a great deal of extra work, and it's possible even less work if you were to vacuum before recently inserted pages got evicted from shared_buffers or the kernel's buffers. The already vacuumed and frozen portion of the table will be skipped using the visibility and freeze map, which is very fast to do. David
Sorry - break for dinner! So much happens on a global scale in a few hours. :-)!
I took a few minutes and created a simple example here of what I imagine you have on that table. I only inserted ~80 million rows of test data, but hopefully, it's somewhat representative.
TimescaleDB's current implementation of SkipScan only allows distinct on one column, and because of where we can place the hook to read the query cost, a LATERAL JOIN (or similar) can't be used to get both columns like you want. So, to outsmart the planner, you can get your results with one of the DISTINCT queries in a function. I realize this is getting a bit specific, so it might not be an exact fit for you, but this query comes back on my 78 million rows in 67ms. YMMV
Step 1: Create a function that returns the array of channels per station
Step 1: Create a function that returns the array of channels per station
CREATE FUNCTION channel_array(TEXT) RETURNS text[]
AS $$
SELECT array_agg(channel) FROM (
select distinct on (channel) channel from stations where station=$1
) a
$$
LANGUAGE SQL;
Step 2: Use a CTE for the distinct stations, querying the function for each station
WITH s1 AS (
SELECT DISTINCT ON (station) station FROM stations
)
SELECT station, channel_array(station) channel
FROM s1;
If it's using the index, you should see something like:
Subquery Scan on s1 (cost=0.57..16.22 rows=19 width=34) (actual time=0.580..4.809 rows=19 loops=1)
-> Unique (cost=0.57..11.28 rows=19 width=2) (actual time=0.043..0.654 rows=19 loops=1)
-> Custom Scan (SkipScan) on stations (cost=0.57..11.23 rows=19 width=2) (actual time=0.042..0.647 rows=19 loops=1)
-> Index Only Scan using idx_station_channel on stations (cost=0.57..1807691.34 rows=76000032 width=2) (actual time=0.040..0.641 rows=19 loops=1)
Index Cond: (station > NULL::text)
Heap Fetches: 19
HTH,
Ryan
On Wed, Sep 22, 2021 at 6:22 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.
Something like:
station | channel
----------|-----------1 11 22 32 4or:station | channel
----------|-----------1 11 22 12 2On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTHRyan BOn Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
On 9/22/21 16:20, David G. Johnston wrote: > I'd probably turn that index into a foreign key that just ensures that > every (station,channel) that appears in the data table also appears on > the lookup table. Grouping and array-ifying the lookup table would be > trivial. Either modify the application code or add a trigger to > populate the lookup table as needed. I fully agree with this. Adding a trigger to populate a lookup table is a standard design in situations like this. Using "DISTINCT" almost always spells trouble for the performance. -- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote: > I was wondering if there was any way to improve the performance of this query: > > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station; > > The explain execution plan can be found here: > https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html> > > and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that isa list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associatedwith each station). This query takes around 5 minutes to run. > > To work around the issue, I created a materialized view that I can update periodically, and of course I can query saidview in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will alsogrow (correct me if I am wrong there). > > This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel) It looks that there is ~ 170 stations, and ~ 800 million rows int he table. can you tell us how many rows has this: select distinct station, channel from data; If this is not huge, then you can make the query run much faster using skip scan - recursive cte. Best regards, depesz
Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.
This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan
On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
>
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>
> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.
>
> To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
>
> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.
can you tell us how many rows has this:
select distinct station, channel from data;
If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.
Best regards,
depesz
> On Sep 22, 2021, at 11:04 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote: >> I was wondering if there was any way to improve the performance of this query: >> >> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station; >> >> The explain execution plan can be found here: >> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html> >> >> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that isa list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associatedwith each station). This query takes around 5 minutes to run. >> >> To work around the issue, I created a materialized view that I can update periodically, and of course I can query saidview in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will alsogrow (correct me if I am wrong there). >> >> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel) > > It looks that there is ~ 170 stations, and ~ 800 million rows int he > table. > > can you tell us how many rows has this: > > select distinct station, channel from data; At the moment, about 170, but I would expect it to stabilize at around 510 or less once I am pulling in all the channels.Getting this query (or the stored/cached results thereof, as it shouldn’t change too often) working fast enoughto be used in the live system is simply the first step to pulling in three times as much data (that’ll be fun!) > > If this is not huge, then you can make the query run much faster using > skip scan - recursive cte. Sounds like something to look into. Of course, if I go with a lookup table, updated by an on insert trigger, it becomes amoot point. I’ll have to spend some time wrapping my head around the concept, and figuring out how to write it so that Iget distinct per station rather than just a straight up distinct, but theoretically at least it makes sense. Thanks! --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > > Best regards, > > depesz >
On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscan
Thanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!
If it helps matters any, my structure is currently the following:
table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.
I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.
If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).
Thanks again for all the suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
>
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>
> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that is a list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels associated with each station). This query takes around 5 minutes to run.
>
> To work around the issue, I created a materialized view that I can update periodically, and of course I can query said view in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will also grow (correct me if I am wrong there).
>
> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.
can you tell us how many rows has this:
select distinct station, channel from data;
If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.
Best regards,
depesz
On 9/23/21 10:16 AM, Israel Brewster wrote:
I would look into pre-loading the lookup table (and pre-emptive maintenance). Add the foreign key, but not the trigger.On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscanThanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!If it helps matters any, my structure is currently the following:table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).
On Sep 23, 2021, at 8:33 AM, Rob Sargent <robjsargent@gmail.com> wrote:On 9/23/21 10:16 AM, Israel Brewster wrote:I would look into pre-loading the lookup table (and pre-emptive maintenance). Add the foreign key, but not the trigger.On Sep 23, 2021, at 4:34 AM, Ryan Booz <ryan@timescale.com> wrote:Heh, I honestly forgot about the recursive CTE. Certainly worth a try and wouldn't require installing other extensions.This is what depesz is referring to: https://wiki.postgresql.org/wiki/Loose_indexscanThanks for the pointer. Will definitely have to spend some time wrapping my brain around that one - I’ve done some CTE’s before, but not recursive that I can recall. Should be fun!If it helps matters any, my structure is currently the following:table “stations” listing station details (name, latitude, longitude, etc) with a smallint primary key “id"table “data” with many (many!) data columns (mostly doubles), a station column that is a smallint referencing the stations table, and a channel column which is a varchar containing the *name* of the channel the data came in on.I will readily accept that this may not be the best structure for the DB. For example, perhaps the channel column should be normalized out as has been mentioned a couple of times as an option. This would make sense, and would certainly simplify this portion of the project.If I do go with a lookup table updated by a trigger, what would be the best option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a query followed by an insert if needed? The normal case would be that the entry already exists (millions of hits vs only the occasional insert needed).
That makes sense. Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
I would look into pre-loading the lookup table (and pre-emptive maintenance). Add the foreign key, but not the trigger. > > That makes sense. Thanks! > Yeah, then I got to wondering: Do you care? Are these stations likely to be spoofed? You have the station id and type in you data table and essentially the same in your lookup table. If you're not replacing the id+type in your data table with a lookup id you really don't need to even have a foreign key. Maybe sync them regularly but I'm not seeing the value in the runtime overhead. Now presumably the station table is entirely pinned in memory and foreign key check might not be much overhead but it won't be zero.
On Wed, 22 Sept 2021 at 21:05, Israel Brewster <ijbrewster@alaska.edu> wrote:
I was wondering if there was any way to improve the performance of this query:SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
If you have tables of possible stations and channels (and if not, why not?), then an EXISTS query, something like
SELECT stations.name, ARRAY_AGG(channels.name)
FROM stations, channels
WHERE EXISTS
(SELECT FROM data WHERE data.channels=channels.name AND data.station=stations.name)
GROUP BY stations.name
will usually be much faster, because it can stop scanning after the first match in the index.
FROM stations, channels
WHERE EXISTS
(SELECT FROM data WHERE data.channels=channels.name AND data.station=stations.name)
GROUP BY stations.name
will usually be much faster, because it can stop scanning after the first match in the index.
Geoff
On Sep 23, 2021, at 10:36 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:On Wed, 22 Sept 2021 at 21:05, Israel Brewster <ijbrewster@alaska.edu> wrote:I was wondering if there was any way to improve the performance of this query:SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;If you have tables of possible stations and channels (and if not, why not?), then an EXISTS query, something likeSELECT stations.name, ARRAY_AGG(channels.name)
FROM stations, channels
WHERE EXISTS
(SELECT FROM data WHERE data.channels=channels.name AND data.station=stations.name)
GROUP BY stations.name
will usually be much faster, because it can stop scanning after the first match in the index.
It’s making me think though, because this similar (though incomplete, and therefore incorrect result) query runs in only 19ms (https://explain.depesz.com/s/iZnN):
SELECT
stations.name,
array_agg(channels.channel)
FROM stations,channels
WHERE EXISTS (SELECT
FROM data
WHERE data.station=stations.id)
GROUP BY stations.name
It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145
Geoff
It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?
Did you do the manual vacuum as suggested by Tom? Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how many records have changed since then. The system is deciding on an index scan because that should be very fast, but if it needs to look at the actual table table to determine if the process executing that query should still be allowed to see that tuple (row version) then the index only scan slows down a lot I believe. The huge number of heap fetches that Tom pointed out mean that the system doesn't know that ALL processes should be able to see those tuples. As someone else suggested lowering the max freeze age, you might just do a manual "vacuum freeze" as well to allow future auto vacuum processes to quickly skip over those pages as well.
I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?
I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?
On Sep 23, 2021, at 8:55 PM, Michael Lewis <mlewis@entrata.com> wrote:It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?Did you do the manual vacuum as suggested by Tom?
I ran a VACUUM ANALYZE, yes.
Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how many records have changed since then.
volcano_seismology=# SELECT seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan | 95
seq_tup_read | 25899340540
idx_scan | 728372
idx_tup_fetch | 51600217033
n_tup_ins | 840283699
n_tup_upd | 66120702
n_tup_del | 2375651
n_tup_hot_upd | 0
n_live_tup | 839266956
n_dead_tup | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum | 24890460
last_vacuum | 2021-09-22 21:32:11.367855+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze | 2021-09-22 21:32:21.071092+00
last_autoanalyze | 2021-09-21 11:54:36.924762+00
vacuum_count | 1
autovacuum_count | 1
analyze_count | 1
autoanalyze_count | 2
-[ RECORD 1 ]-------+------------------------------
seq_scan | 95
seq_tup_read | 25899340540
idx_scan | 728372
idx_tup_fetch | 51600217033
n_tup_ins | 840283699
n_tup_upd | 66120702
n_tup_del | 2375651
n_tup_hot_upd | 0
n_live_tup | 839266956
n_dead_tup | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum | 24890460
last_vacuum | 2021-09-22 21:32:11.367855+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze | 2021-09-22 21:32:21.071092+00
last_autoanalyze | 2021-09-21 11:54:36.924762+00
vacuum_count | 1
autovacuum_count | 1
analyze_count | 1
autoanalyze_count | 2
Note that the update count was due to a (hopefully) one-time process where I had to change the value of a bunch of records. Generally this *should be* an insert-once-read-many database.
The system is deciding on an index scan because that should be very fast, but if it needs to look at the actual table table to determine if the process executing that query should still be allowed to see that tuple (row version) then the index only scan slows down a lot I believe. The huge number of heap fetches that Tom pointed out mean that the system doesn't know that ALL processes should be able to see those tuples. As someone else suggested lowering the max freeze age, you might just do a manual "vacuum freeze" as well to allow future auto vacuum processes to quickly skip over those pages as well.
Ok, ran a VACUUM (ANALYZE, FREEZE), and am now showing this from pg_stat_user_tables:
volcano_seismology=# SELECT seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan | 96
seq_tup_read | 26737263238
idx_scan | 732396
idx_tup_fetch | 52571927369
n_tup_ins | 841017819
n_tup_upd | 66120702
n_tup_del | 2388723
n_tup_hot_upd | 0
n_live_tup | 840198830
n_dead_tup | 10173
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum | 2021-09-24 17:18:18.34282+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze | 2021-09-24 17:18:31.576238+00
last_autoanalyze | 2021-09-21 11:54:36.924762+00
vacuum_count | 2
autovacuum_count | 1
analyze_count | 2
autoanalyze_count | 2
-[ RECORD 1 ]-------+------------------------------
seq_scan | 96
seq_tup_read | 26737263238
idx_scan | 732396
idx_tup_fetch | 52571927369
n_tup_ins | 841017819
n_tup_upd | 66120702
n_tup_del | 2388723
n_tup_hot_upd | 0
n_live_tup | 840198830
n_dead_tup | 10173
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum | 2021-09-24 17:18:18.34282+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze | 2021-09-24 17:18:31.576238+00
last_autoanalyze | 2021-09-21 11:54:36.924762+00
vacuum_count | 2
autovacuum_count | 1
analyze_count | 2
autoanalyze_count | 2
However, adding the AND data.channels=channels.channel to the query still makes it take around 5 minutes (https://explain.depesz.com/s/7hb1). So, again, running VACUUM didn’t appear to help any.
Also perhaps interestingly, if I again modify the query to only match on channel, not station, it is again fast (though not quite as fast): https://explain.depesz.com/s/HLb8
So, basically, I can quickly get a list of all channels for which I have data, or all stations for which I have data, but getting a list of all channels for each station is slow.
I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?
This is my stumbling block to implementing this option as well. That said, with the lookup table in place, perhaps all the rest of this is a moot point?
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
cell: 907-328-9145