Thread: Query optimization
Hello,
I'm trying to optimize a query that takes too much time. This is what I havetable3: is empty, and must be filled with data coming from table1 and table2
To fill table3, I'm using a query that looks like this:
WITH records_to_insert(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) AS
(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.field7 = field7))) as values_array FROM table1)
SELECT fill_table3_function(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) FROM records_to_insert
WITH records_to_insert(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) AS
(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.field7 = field7))) as values_array FROM table1)
SELECT fill_table3_function(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) FROM records_to_insert
So, I first generate a common table records_to_insert, using data from table1 and table2, and then call a function fill_table3_function, in order to insert the values into table3 (I do more things apart from insert, that's reason to call a function instead of just raising an insert query). There are indexes created on all the columns that need them.
I'm having problems because the query takes a lot of time, and the server returns a timeout error.
I think the bottleneck is the array built for the last argument of my common table, and maybe the hstore too (not sure if it's a heavy process or not). First thing I've tried is to limit the query to build the common table, using LIMIT .. OFFSET after 'FROM table1', and make a loop to seek table1 (the big one, with 8 million rows). But still getting timeout errors.
I think the bottleneck is the array built for the last argument of my common table, and maybe the hstore too (not sure if it's a heavy process or not). First thing I've tried is to limit the query to build the common table, using LIMIT .. OFFSET after 'FROM table1', and make a loop to seek table1 (the big one, with 8 million rows). But still getting timeout errors.
The problem is I don't have access to Postgres configuration, in order to increase the timeout for user queries. And anyway, I don't think that increasing the timeout is a real solution (It'll just make the server suffer for more time).
So, is there anything obviously wrong with my query? Any changes to make it faster?
Many thanks in advance, and best regards,
Jorge Arévalo-2 wrote > So, I first generate a common table records_to_insert, using data from > table1 and table2, and then call a function fill_table3_function, in order > to insert the values into table3 (I do more things apart from insert, > that's reason to call a function instead of just raising an insert query). > There are indexes created on all the columns that need them. I would suggest you figure out how to do what you need without pushing the insert into the function. > So, is there anything obviously wrong with my query? Any changes to make > it > faster? You shouldn't expect much useful help when the important part of your query is not provided. Creating arrays and hstores is expensive but whether that is the biggest factor is impossible to tell. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824744.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Jorge Arévalo-2 wrote > (SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9', > 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, > field10, field11, field12, field13, field14) as metadata, value7, (select > array((select row(f1, f2) from table2 p where p.field7 = field7))) as > values_array FROM table1) You might try seeing whether: FROM table1 JOIN ( SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY field7 ) tbl2_agg USING (field7) helps... I'm also dubious (though this isn't necessarily a performance issue) of: array[...] AS metadata Without context I would say this would be better as a composite type instead of an array. You may find it useful to use named composite types elsewhere too... David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Hello David, many thanks for your responses,
Sorry for not providing the content of the fill_table3_function, but it just executes 3 insert queries in 3 different tables. And I've checked the time consuming operation is in this query (by the way, there was a little mistake in the name of the fields of the inner select, I've corrected it)SELECT value1,value2,value3,value4,
value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1
This is the result of EXPLAIN ANALYZE
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)
So, there's a sequential scan over table2 (in the query to get values_array), instead of a index scan. Could it be because the SELECT returns more than approximately 5-10% of all rows in the table? (I've heard that, under those conditions, a sequential scan is faster than index scan, because the amount of I/O operations required for each row)
Anyway, if I understood well, I should try:
- Avoiding that inner query by using a JOIN instead
- Return a composite type instead of an array
Am I right? What kind of additional context information would you need?
Many thanks!!
--
Jorge
On Wed, Oct 29, 2014 at 2:54 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Jorge Arévalo-2 wrote
> (SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
> 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
> field10, field11, field12, field13, field14) as metadata, value7, (select
> array((select row(f1, f2) from table2 p where p.field7 = field7))) as
> values_array FROM table1)
You might try seeing whether:
FROM table1 JOIN (
SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY
field7
) tbl2_agg USING (field7)
helps...
I'm also dubious (though this isn't necessarily a performance issue) of:
array[...] AS metadata
Without context I would say this would be better as a composite type instead
of an array. You may find it useful to use named composite types elsewhere
too...
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.htmlSent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Jorge Arevalo
Freelance developer
http://about.me/jorgeas80
List preference is to inline post or, at worse, bottom post. Please do not top post.
On Wed, Oct 29, 2014 at 11:06 AM, Jorge Arevalo <jorgearevalo@libregis.org> wrote:
Hello David, many thanks for your responses,Sorry for not providing the content of the fill_table3_function, but it just executes 3 insert queries in 3 different tables. And I've checked the time consuming operation is in this query (by the way, there was a little mistake in the name of the fields of the inner select, I've corrected it)
SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1This is the result of EXPLAIN ANALYZE
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=67846.29..67846.29 rows=1 width=0) (actual time=7009.063..7009.065 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
Filter: (f3 = field7)So, there's a sequential scan over table2 (in the query to get values_array), instead of a index scan. Could it be because the SELECT returns more than approximately 5-10% of all rows in the table? (I've heard that, under those conditions, a sequential scan is faster than index scan, because the amount of I/O operations required for each row)
What concerns me here is that the rows estimate on that table2 scan is ~13k while the actual count is ~ 2.5M; you need to run ANALYZE on both tables and see if your get similar results. Though given the need for sequential scan regardless (see next comment) the estimate miss likely doesn't affect actual performance or the plan that is chosen. But it is still worth looking into.
Anyway, if I understood well, I should try:- Avoiding that inner query by using a JOIN instead
I don't know...looking at your explain (and some reasoning) it looks as if it is already doing that for you since there is only a single loop for the InitPlan 1. This is a little beyond my comfort zone but you've now provided a decent amount of information for others to speculate...though it would help to enable various timings as well and try and run the full query (with the function) in a development environment so that the entire routine can be evaluated.
- Return a composite type instead of an array
Worth looking into but impossible to recommend without knowing what your make believe fields are and are used for. More style than performance since I do not know the relative costs of building up an array and creating a composite.
Jorge Arevalo <jorgearevalo@libregis.org> writes: > SELECT value1,value2,value3,value4, > value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', > 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as > metadata, value7, (select array((select row(f1, f2) from table2 p where > p.f3 = field7))) as values_array FROM table1 The parentheses/brackets don't seem to match up real well here ... > This is the result of EXPLAIN ANALYZE > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 > rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 > loops=1) > InitPlan 2 (returns $1) > -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual > time=7009.063..7009.065 rows=1 loops=1) > InitPlan 1 (returns $0) > -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 > width=20) (actual time=14.971..5069.840 rows=2537787 loops=1) > Filter: (f3 = field7) Hm. If I'm reading that right, you're building an array containing 2537787 entries, each of which is a composite datum containing two columns of unmentioned datatypes. I suspect a big chunk of your runtime is going into manipulating that array -- PG is not terribly efficient with big arrays containing variable-width values. I'm also a bit confused as to why the planner is saying that the (SELECT ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that "field7" in the innermost WHERE clause is not a reference to table1 but a reference to table2. Is that really what you meant? IOW, are you sure this query is performing the right calculation in the first place? regards, tom lane
Jorge Arevalo <jorgearevalo@libregis.org> writes:
> This is the result of EXPLAIN ANALYZE
> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
> rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
> loops=1)
> InitPlan 2 (returns $1)
> -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
> time=7009.063..7009.065 rows=1 loops=1)
> InitPlan 1 (returns $0)
> -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
> width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
> Filter: (f3 = field7)
Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.
I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?
I thought the InitPlan was in place because the planner choose to execute the correlated subquery as a standalone query since it realizes that it is going to have to end up processing the entire table anyway due to the lack of a filter on the outer query. In effect executing "table1 JOIN (table2 subquery) ON (f3 = field7)".
David J.
On Wed, Oct 29, 2014 at 8:05 PM, David Johnston <david.g.johnston@gmail.com> wrote:
Jorge Arevalo <jorgearevalo@libregis.org> writes:> This is the result of EXPLAIN ANALYZE
> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
> rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
> loops=1)
> InitPlan 2 (returns $1)
> -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
> time=7009.063..7009.065 rows=1 loops=1)
> InitPlan 1 (returns $0)
> -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
> width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
> Filter: (f3 = field7)
Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.
The seq scan over table2 is for finding entries in table2 (which contains 2537787) that matches a condition using a column from table1 (entries that match table1.field7 = table2.f3). But the array isn't going to contain all the entries, Just a few of them.
I think the time is being used in scanning table2 for all the rows of table1 (plus than 8 million).
I think the time is being used in scanning table2 for all the rows of table1 (plus than 8 million).
I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?I thought the InitPlan was in place because the planner choose to execute the correlated subquery as a standalone query since it realizes that it is going to have to end up processing the entire table anyway due to the lack of a filter on the outer query. In effect executing "table1 JOIN (table2 subquery) ON (f3 = field7)".David J.
Yes, for each row of table1, table2 is being scanned, to find all the entries that satisfy table1.field7 = table2.f3. Sounds that a really heavy task. I guess I should avoid it, right?
BTW, Tom, this is the query with all the parentheses/brackets
SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1
SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1
Oh, and sorry for the top posting!
Jorge Arevalo <jorgearevalo@libregis.org> writes: >> On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I'm also a bit confused as to why the planner is saying that the (SELECT >>> ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that >>> "field7" in the innermost WHERE clause is not a reference to table1 but a >>> reference to table2. Is that really what you meant? IOW, are you sure >>> this query is performing the right calculation in the first place? > Yes, for each row of table1, table2 is being scanned, to find all the > entries that satisfy table1.field7 = table2.f3. You might think that's what it's doing, but the plan shape says otherwise. An array of 2537787 entries is being formed *once* and then referenced at each row of table1. I'd take another look and see if "field7" isn't a column name that exists in both tables. regards, tom lane
SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1
More generally, you really should table-prefix all column in correlated subqueries.
[...] from table2 p where p.f3 = table1.field7 [...]
I guess the InitPlan 1 you showed simply scanned table2 and applied the filter which then was fed to InitPlan 2 where the array is built; that array then is inserted into the outer query ~8M times...
David J.
On Wed, Oct 29, 2014 at 8:47 PM, David Johnston <david.g.johnston@gmail.com> wrote:
SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1More generally, you really should table-prefix all column in correlated subqueries.[...] from table2 p where p.f3 = table1.field7 [...]I guess the InitPlan 1 you showed simply scanned table2 and applied the filter which then was fed to InitPlan 2 where the array is built; that array then is inserted into the outer query ~8M times...David J.
Wow, you were right! There was a field with same name in both tables, and that caused problems. I've just prefixed each field with the table identifier, and now it works really fast.
Many thanks, guys!