Home > mailing lists

Thread: Query optimization

Query optimization

From

Jorge Arevalo

Date:

29 October 2014, 01:03:47

Hello,

I'm trying to optimize a query that takes too much time. This is what I have

table1(field1, field2... field14): contains about 8.5 million rows

table2(f1, f2, f3): contains about 2.5 million rows

table3: is empty, and must be filled with data coming from table1 and table2

To fill table3, I'm using a query that looks like this:

WITH records_to_insert(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) AS

(SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.field7 = field7))) as values_array FROM table1)

SELECT fill_table3_function(field1,field2,field3,field4,field5,field6_hstore,field7,field8_array) FROM records_to_insert

So, I first generate a common table records_to_insert, using data from table1 and table2, and then call a function fill_table3_function, in order to insert the values into table3 (I do more things apart from insert, that's reason to call a function instead of just raising an insert query). There are indexes created on all the columns that need them.

I'm having problems because the query takes a lot of time, and the server returns a timeout error.

I think the bottleneck is the array built for the last argument of my common table, and maybe the hstore too (not sure if it's a heavy process or not). First thing I've tried is to limit the query to build the common table, using LIMIT .. OFFSET after 'FROM table1', and make a loop to seek table1 (the big one, with 8 million rows). But still getting timeout errors.

The problem is I don't have access to Postgres configuration, in order to increase the timeout for user queries. And anyway, I don't think that increasing the timeout is a real solution (It'll just make the server suffer for more time).

So, is there anything obviously wrong with my query? Any changes to make it faster?

Many thanks in advance, and best regards,

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

Re: Query optimization

From

David G Johnston

Date:

29 October 2014, 01:46:32

Jorge Arévalo-2 wrote
> So, I first generate a common table records_to_insert, using data from
> table1 and table2, and then call a function fill_table3_function, in order
> to insert the values into table3 (I do more things apart from insert,
> that's reason to call a function instead of just raising an insert query).
> There are indexes created on all the columns that need them.

I would suggest you figure out how to do what you need without pushing the
insert into the function.


> So, is there anything obviously wrong with my query? Any changes to make
> it
> faster?

You shouldn't expect much useful help when the important part of your query
is not provided.

Creating arrays and hstores is expensive but whether that is the biggest
factor is impossible to tell.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824744.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Query optimization

From

David G Johnston

Date:

29 October 2014, 01:54:28

Jorge Arévalo-2 wrote
> (SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
> 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
> field10, field11, field12, field13, field14) as metadata, value7, (select
> array((select row(f1, f2) from table2 p where p.field7 = field7))) as
> values_array FROM table1)

You might try seeing whether:

FROM table1 JOIN (
SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY
field7
) tbl2_agg USING (field7)

helps...

I'm also dubious (though this isn't necessarily a performance issue) of:

array[...] AS metadata

Without context I would say this would be better as a composite type instead
of an array.  You may find it useful to use named composite types elsewhere
too...

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Query optimization

From

Jorge Arevalo

Date:

29 October 2014, 18:06:32

Hello David, many thanks for your responses,

Sorry for not providing the content of the fill_table3_function, but it just executes 3 insert queries in 3 different tables. And I've checked the time consuming operation is in this query (by the way, there was a little mistake in the name of the fields of the inner select, I've corrected it)

SELECT value1,value2,value3,value4,

value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1

This is the result of EXPLAIN ANALYZE

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 loops=1)
   InitPlan 2 (returns $1)
     -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual time=7009.063..7009.065 rows=1 loops=1)
           InitPlan 1 (returns $0)
             -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
                   Filter: (f3 = field7)

So, there's a sequential scan over table2 (in the query to get values_array), instead of a index scan. Could it be because the SELECT returns more than approximately 5-10% of all rows in the table? (I've heard that, under those conditions, a sequential scan is faster than index scan, because the amount of I/O operations required for each row)

Anyway, if I understood well, I should try:

- Avoiding that inner query by using a JOIN instead

- Return a composite type instead of an array

Am I right? What kind of additional context information would you need?

Many thanks!!

Jorge

On Wed, Oct 29, 2014 at 2:54 AM, David G Johnston <david.g.johnston@gmail.com> wrote:

Jorge Arévalo-2 wrote
> (SELECT value1,value2,value3,value4,value5, hstore(ARRAY['field9',
> 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9,
> field10, field11, field12, field13, field14) as metadata, value7, (select
> array((select row(f1, f2) from table2 p where p.field7 = field7))) as
> values_array FROM table1)

You might try seeing whether:

FROM table1 JOIN (
SELECT field7, array_agg(row(f1, f2)) AS values_array FROM table2 GROUP BY
field7
) tbl2_agg USING (field7)

helps...

I'm also dubious (though this isn't necessarily a performance issue) of:

array[...] AS metadata

Without context I would say this would be better as a composite type instead
of an array. You may find it useful to use named composite types elsewhere
too...

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-optimization-tp5824739p5824746.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

Re: Query optimization

From

David Johnston

Date:

29 October 2014, 18:24:29

List preference is to inline post or, at worse, bottom post. Please do not top post.

On Wed, Oct 29, 2014 at 11:06 AM, Jorge Arevalo <jorgearevalo@libregis.org> wrote:

Hello David, many thanks for your responses,

Sorry for not providing the content of the fill_table3_function, but it just executes 3 insert queries in 3 different tables. And I've checked the time consuming operation is in this query (by the way, there was a little mistake in the name of the fields of the inner select, I've corrected it)

SELECT value1,value2,value3,value4,
value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1

This is the result of EXPLAIN ANALYZE

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=67846.38..395773.45 rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127 loops=1)
   InitPlan 2 (returns $1)
     -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual time=7009.063..7009.065 rows=1 loops=1)
           InitPlan 1 (returns $0)
             -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689 width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
                   Filter: (f3 = field7)

So, there's a sequential scan over table2 (in the query to get values_array), instead of a index scan. Could it be because the SELECT returns more than approximately 5-10% of all rows in the table? (I've heard that, under those conditions, a sequential scan is faster than index scan, because the amount of I/O operations required for each row)

What concerns me here is that the rows estimate on that table2 scan is ~13k while the actual count is ~ 2.5M; you need to run ANALYZE on both tables and see if your get similar results. Though given the need for sequential scan regardless (see next comment) the estimate miss likely doesn't affect actual performance or the plan that is chosen. But it is still worth looking into.

Anyway, if I understood well, I should try:

- Avoiding that inner query by using a JOIN instead

I don't know...looking at your explain (and some reasoning) it looks as if it is already doing that for you since there is only a single loop for the InitPlan 1. This is a little beyond my comfort zone but you've now provided a decent amount of information for others to speculate...though it would help to enable various timings as well and try and run the full query (with the function) in a development environment so that the entire routine can be evaluated.

- Return a composite type instead of an array

Worth looking into but impossible to recommend without knowing what your make believe fields are and are used for. More style than performance since I do not know the relative costs of building up an array and creating a composite.

Re: Query optimization

From

Tom Lane

Date:

29 October 2014, 18:53:26

Jorge Arevalo <jorgearevalo@libregis.org> writes:
> SELECT value1,value2,value3,value4,
> value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13',
> 'field14'], ARRAY[field9, field10, field11, field12, field13, field14) as
> metadata, value7, (select array((select row(f1, f2) from table2 p where
> p.f3 = field7))) as values_array FROM table1

The parentheses/brackets don't seem to match up real well here ...

> This is the result of EXPLAIN ANALYZE

>                                                                    QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using table1_pkey on table1  (cost=67846.38..395773.45
> rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
> loops=1)
>    InitPlan 2 (returns $1)
>      ->  Result  (cost=67846.29..67846.29 rows=1 width=0) (actual
> time=7009.063..7009.065 rows=1 loops=1)
>            InitPlan 1 (returns $0)
>              ->  Seq Scan on table2 p  (cost=0.00..67846.29 rows=12689
> width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
>                    Filter: (f3 = field7)

Hm.  If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes.  I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan.  That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2.  Is that really what you meant?  IOW, are you sure
this query is performing the right calculation in the first place?

            regards, tom lane

Re: Query optimization

From

David Johnston

Date:

29 October 2014, 19:05:47

On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Arevalo <jorgearevalo@libregis.org> writes:

> This is the result of EXPLAIN ANALYZE

> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
> rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
> loops=1)
> InitPlan 2 (returns $1)
> -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
> time=7009.063..7009.065 rows=1 loops=1)
> InitPlan 1 (returns $0)
> -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
> width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
> Filter: (f3 = field7)

Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

I thought the InitPlan was in place because the planner choose to execute the correlated subquery as a standalone query since it realizes that it is going to have to end up processing the entire table anyway due to the lack of a filter on the outer query. In effect executing "table1 JOIN (table2 subquery) ON (f3 = field7)".

David J.

Re: Query optimization

From

Jorge Arevalo

Date:

29 October 2014, 19:14:26

On Wed, Oct 29, 2014 at 8:05 PM, David Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jorge Arevalo <jorgearevalo@libregis.org> writes:

> This is the result of EXPLAIN ANALYZE

> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=67846.38..395773.45
> rows=8419127 width=88) (actual time=7122.704..22670.680 rows=8419127
> loops=1)
> InitPlan 2 (returns $1)
> -> Result (cost=67846.29..67846.29 rows=1 width=0) (actual
> time=7009.063..7009.065 rows=1 loops=1)
> InitPlan 1 (returns $0)
> -> Seq Scan on table2 p (cost=0.00..67846.29 rows=12689
> width=20) (actual time=14.971..5069.840 rows=2537787 loops=1)
> Filter: (f3 = field7)

Hm. If I'm reading that right, you're building an array containing
2537787 entries, each of which is a composite datum containing two
columns of unmentioned datatypes. I suspect a big chunk of your
runtime is going into manipulating that array -- PG is not terribly
efficient with big arrays containing variable-width values.

The seq scan over table2 is for finding entries in table2 (which contains 2537787) that matches a condition using a column from table1 (entries that match table1.field7 = table2.f3). But the array isn't going to contain all the entries, Just a few of them.

I think the time is being used in scanning table2 for all the rows of table1 (plus than 8 million).

I'm also a bit confused as to why the planner is saying that the (SELECT
ARRAY(...)) bit is an InitPlan and not a SubPlan. That implies that
"field7" in the innermost WHERE clause is not a reference to table1 but a
reference to table2. Is that really what you meant? IOW, are you sure
this query is performing the right calculation in the first place?

I thought the InitPlan was in place because the planner choose to execute the correlated subquery as a standalone query since it realizes that it is going to have to end up processing the entire table anyway due to the lack of a filter on the outer query. In effect executing "table1 JOIN (table2 subquery) ON (f3 = field7)".

David J.

Yes, for each row of table1, table2 is being scanned, to find all the entries that satisfy table1.field7 = table2.f3. Sounds that a really heavy task. I guess I should avoid it, right?

BTW, Tom, this is the query with all the parentheses/brackets

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1

Oh, and sorry for the top posting!

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80

Re: Query optimization

From

Tom Lane

Date:

29 October 2014, 19:39:55

Jorge Arevalo <jorgearevalo@libregis.org> writes:
>> On Wed, Oct 29, 2014 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm also a bit confused as to why the planner is saying that the (SELECT
>>> ARRAY(...)) bit is an InitPlan and not a SubPlan.  That implies that
>>> "field7" in the innermost WHERE clause is not a reference to table1 but a
>>> reference to table2.  Is that really what you meant?  IOW, are you sure
>>> this query is performing the right calculation in the first place?

> Yes, for each row of table1, table2 is being scanned, to find all the
> entries that satisfy table1.field7 = table2.f3.

You might think that's what it's doing, but the plan shape says otherwise.
An array of 2537787 entries is being formed *once* and then referenced at
each row of table1.  I'd take another look and see if "field7" isn't a
column name that exists in both tables.

            regards, tom lane

Re: Query optimization

From

David Johnston

Date:

29 October 2014, 19:48:01

On Wed, Oct 29, 2014 at 12:14 PM, Jorge Arevalo <jorgearevalo@libregis.org> wrote:

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1

More generally, you really should table-prefix all column in correlated subqueries.

[...] from table2 p where p.f3 = table1.field7 [...]

I guess the InitPlan 1 you showed simply scanned table2 and applied the filter which then was fed to InitPlan 2 where the array is built; that array then is inserted into the outer query ~8M times...

David J.

Re: Query optimization

From

Jorge Arevalo

Date:

29 October 2014, 20:26:29

On Wed, Oct 29, 2014 at 8:47 PM, David Johnston <david.g.johnston@gmail.com> wrote:

On Wed, Oct 29, 2014 at 12:14 PM, Jorge Arevalo <jorgearevalo@libregis.org> wrote:

SELECT value1,value2,value3,value4, value5, hstore(ARRAY['field9', 'field10', 'field11', 'field12', 'field13', 'field14'], ARRAY[field9, field10, field11, field12, field13, field14]) as metadata, value7, (select array((select row(f1, f2) from table2 p where p.f3 = field7))) as values_array FROM table1

More generally, you really should table-prefix all column in correlated subqueries.

[...] from table2 p where p.f3 = table1.field7 [...]

I guess the InitPlan 1 you showed simply scanned table2 and applied the filter which then was fed to InitPlan 2 where the array is built; that array then is inserted into the outer query ~8M times...

David J.

Wow, you were right! There was a field with same name in both tables, and that caused problems. I've just prefixed each field with the table identifier, and now it works really fast.

Many thanks, guys!

--
Jorge Arevalo
Freelance developer

http://about.me/jorgeas80