Thread: Performance problems with DISTINCT ON

Performance problems with DISTINCT ON

From

"Sgarbossa Domenico"

Date:

29 September 2009, 03:57:57

I need to retrieve the most recent prices per products from a price list table:

CREATE TABLE listini_anagrafici
(
id character varying(36) NOT NULL,
articolo character varying(18),
listino character varying(5),
data_ent date,
data_fin date,
prezzo double precision,
ultimo boolean DEFAULT false,
date_entered timestamp without time zone NOT NULL,
date_modified timestamp without time zone NOT NULL,
created_by character varying(36),
modified_user_id character varying(36) NOT NULL,
deleted boolean NOT NULL DEFAULT false,
CONSTRAINT listini_anagrafici_id_key UNIQUE (id)
)

I guess the right query is:

select distinct on (articolo) articolo,data_ent,prezzo from listini_anagrafici order by articolo, data_ent desc

but it seems that this query runs slowly... about 5/6 seconds.

the table contains more or less 500K records, PostgreSQL version is 8.1.11 and the server has 4gb of RAM entirely dedicate to the db.

I've tried adding this index

CREATE INDEX articolo_data_ent ON listini_anagrafici (articoli, data_ent)

but it doesn't helps.

As you can see from the explain command (below) the query seems to ignore the index

'Unique (cost=73893.89..76551.25 rows=88312 width=24) (actual time=4022.578..5076.206 rows=193820 loops=1)'
' -> Sort (cost=73893.89..75222.57 rows=531472 width=24) (actual time=4022.574..4505.538 rows=531472 loops=1)'
' Sort Key: articolo, data_ent'
' -> Seq Scan on listini_anagrafici (cost=0.00..16603.72 rows=531472 width=24) (actual time=0.009..671.797 rows=531472 loops=1)'
'Total runtime: 5217.452 ms'

anyone knows how to make this query run faster?

Re: Performance problems with DISTINCT ON

From

Richard Huxton

Date:

29 September 2009, 05:28:52

Sgarbossa Domenico wrote:
> I need to retrieve the most recent prices per products from a price list table:

> select distinct on (articolo) articolo,data_ent,prezzo from listini_anagrafici order by articolo, data_ent desc
>
> but it seems that this query runs slowly... about 5/6 seconds.
> the table contains more or less 500K records, PostgreSQL version is 8.1.11 and the server has 4gb of RAM entirely
dedicateto the db. 

> 'Unique  (cost=73893.89..76551.25 rows=88312 width=24) (actual time=4022.578..5076.206 rows=193820 loops=1)'
> '  ->  Sort  (cost=73893.89..75222.57 rows=531472 width=24) (actual time=4022.574..4505.538 rows=531472 loops=1)'
> '        Sort Key: articolo, data_ent'
> '        ->  Seq Scan on listini_anagrafici  (cost=0.00..16603.72 rows=531472 width=24) (actual time=0.009..671.797
rows=531472loops=1)' 
> 'Total runtime: 5217.452 ms'

You've got 531472 rows in the table and the query is going to output
193820 of them. Scanning the whole table is almost certainly the way to go.

If the table doesn't change much, you could try running a CLUSTER on the
index you've created. That will lock the table while it re-orders the
physical layout of the rows based on your index though, so it's no good
if the table is updated much.

Failing that, you could try issuing "set work_mem = ..." before the
query with increasing sizes for work_mem. That might make the sort
faster too.

--
  Richard Huxton
  Archonet Ltd

Re: Performance problems with DISTINCT ON

From

"Sgarbossa Domenico"

Date:

29 September 2009, 09:47:29

Subject: Re: [PERFORM] Performance problems with DISTINCT ON


> Sgarbossa Domenico wrote:
>> I need to retrieve the most recent prices per products from a price list
>> table:
>
>> select distinct on (articolo) articolo,data_ent,prezzo from
>> listini_anagrafici order by articolo, data_ent desc
>>
>> but it seems that this query runs slowly... about 5/6 seconds.
>> the table contains more or less 500K records, PostgreSQL version is
>> 8.1.11 and the server has 4gb of RAM entirely dedicate to the db.
>
>> 'Unique  (cost=73893.89..76551.25 rows=88312 width=24) (actual
>> time=4022.578..5076.206 rows=193820 loops=1)'
>> '  ->  Sort  (cost=73893.89..75222.57 rows=531472 width=24) (actual
>> time=4022.574..4505.538 rows=531472 loops=1)'
>> '        Sort Key: articolo, data_ent'
>> '        ->  Seq Scan on listini_anagrafici  (cost=0.00..16603.72
>> rows=531472 width=24) (actual time=0.009..671.797 rows=531472 loops=1)'
>> 'Total runtime: 5217.452 ms'
>
> You've got 531472 rows in the table and the query is going to output
> 193820 of them. Scanning the whole table is almost certainly the way to
> go.
>
> If the table doesn't change much, you could try running a CLUSTER on the
> index you've created. That will lock the table while it re-orders the
> physical layout of the rows based on your index though, so it's no good
> if the table is updated much.
>
> Failing that, you could try issuing "set work_mem = ..." before the
> query with increasing sizes for work_mem. That might make the sort
> faster too.
>

Thank you for the answer,
I've tried as you suggest but the only things that seems make some
differences is the work_mem parameter
This helps to reduce the amount of time about for the half (3 seconds) but
unfortunately this ain't enough.
If  there are a lot of concurrent request I think it could made the data
swap to the disk.
Should I try a different approach to solve this issue?

Re: Performance problems with DISTINCT ON

From

hubert depesz lubaczewski

Date:

29 September 2009, 09:53:45

> Should I try a different approach to solve this issue?

Yes. Ask yourself if you *really* need 180k rows.

Best regards,

depesz

--
Linkedin: http://www.linkedin.com/in/depesz  /  blog: http://www.depesz.com/
jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Re: Performance problems with DISTINCT ON

From

Scott Carey

Date:

29 September 2009, 14:53:42

Distinct on Postgres 8.1 forces a sort.  It may be faster if you restructure
the query to use a group by + order by.  But that might not help either,
since the data might not be large enough for a plan that hash aggregates and
then sorts the result to be faster.

An index on (articolo, data_end desc) might help -- but only if the planner
thinks that the index scan is faster.  You may have to tweak the cost
parameter for random I/O downward to get it to choose a plan to use that
index -- which will be faster if the index and data are in memory, but will
be slower if it has to go too much to often to disk.

If this query is being done a lot, and concurrently, it sounds like the
application needs some tweaks.  The result might be application cacheable
for short intervals of time, for example.  Or, if only small bits of the
table are updated, a timestamp column and filter to select only the parts
updated can allow a client application to merge the updates with a previous
full result client side.

On 9/29/09 5:44 AM, "Sgarbossa Domenico" <domenico.sgarbossa@eniac.it>
wrote:

> Subject: Re: [PERFORM] Performance problems with DISTINCT ON
>
>
>> Sgarbossa Domenico wrote:
>>> I need to retrieve the most recent prices per products from a price list
>>> table:
>>
>>> select distinct on (articolo) articolo,data_ent,prezzo from
>>> listini_anagrafici order by articolo, data_ent desc
>>>
>>> but it seems that this query runs slowly... about 5/6 seconds.
>>> the table contains more or less 500K records, PostgreSQL version is
>>> 8.1.11 and the server has 4gb of RAM entirely dedicate to the db.
>>
>>> 'Unique  (cost=73893.89..76551.25 rows=88312 width=24) (actual
>>> time=4022.578..5076.206 rows=193820 loops=1)'
>>> '  ->  Sort  (cost=73893.89..75222.57 rows=531472 width=24) (actual
>>> time=4022.574..4505.538 rows=531472 loops=1)'
>>> '        Sort Key: articolo, data_ent'
>>> '        ->  Seq Scan on listini_anagrafici  (cost=0.00..16603.72
>>> rows=531472 width=24) (actual time=0.009..671.797 rows=531472 loops=1)'
>>> 'Total runtime: 5217.452 ms'
>>
>> You've got 531472 rows in the table and the query is going to output
>> 193820 of them. Scanning the whole table is almost certainly the way to
>> go.
>>
>> If the table doesn't change much, you could try running a CLUSTER on the
>> index you've created. That will lock the table while it re-orders the
>> physical layout of the rows based on your index though, so it's no good
>> if the table is updated much.
>>
>> Failing that, you could try issuing "set work_mem = ..." before the
>> query with increasing sizes for work_mem. That might make the sort
>> faster too.
>>
>
> Thank you for the answer,
> I've tried as you suggest but the only things that seems make some
> differences is the work_mem parameter
> This helps to reduce the amount of time about for the half (3 seconds) but
> unfortunately this ain't enough.
> If  there are a lot of concurrent request I think it could made the data
> swap to the disk.
> Should I try a different approach to solve this issue?
>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>