Thread: Very slow update statement on 40mio rows

Very slow update statement on 40mio rows

From

Florian Schröck

Date:

15 February 2013, 14:31:02

Hi!
I'm new to this mailinglist and I'm new to postgres as well. It is about our own backup software (java); we switched the DB from MySQL to postgres and we need some help.
The backup database holds all files from the server in the database. On my testing platform there are about 40 mio rows and the DB size grew to about 70 GB (the same DB was about 11 GB with MySQL, but this is another issue).

Before each backup job, there is a reset to get a consistent state of all files:
UPDATE BackupFiles SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0, cStatus='NEW'::StatusT, bOnSetBlue=false, bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00' WHERE cStatus='NEW' OR cStatus='WRITING'OR cStatus='ONTAPE';

Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes. In MySQL the same statement takes about 2 minutes on the same HW.
I tried to optimize the settings but until now without success.

Can we optimize this update statement somehow? Do you have any other ideas?

Any help is appreciated! Thank you!

My current config:
shared_buffers = 2GBwork_mem = 16MB wal_buffers = 16MBcheckpoint_segments = 45random_page_cost = 2.0effective_cache_size = 6GB

HW:
2x Intel E5405 @ 2.00GHz
8 GB RAM
3ware 9650SE-16ML RAID Controller, all caches enabled
DB is on a RAID6 with 14x 1TB (since the DB partition on the RAID1 was too small)

SW:
Debian Squeeze 6.0 with Kernel 3.5.4
Postgres 8.4.13 (standard Debian package)

Table:
+---------------+-----------------------------+---------------------------------------------------------------------+| Column | Type | Modifiers |+---------------+-----------------------------+---------------------------------------------------------------------+| _rowid | bigint | not null default nextval('backupfiles__rowid_seq'::regclass) || cfilename | bytea | not null || nfilesize | bigint | not null default 0::bigint || nfilectimets | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone || ntapenr | integer | not null default 0 || nafiocounter | bigint | not null default 0::bigint || nblockcounter | bigint | not null default 0::bigint || cstatus | statust | not null default 'NEW'::statust || bonsetblue | boolean | default false || bonsetyellow | boolean | default false || nlastbackupts | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone |+---------------+-----------------------------+---------------------------------------------------------------------+Indexes: "backupfiles_pkey" PRIMARY KEY, btree (_rowid) "cfilename_index" btree (cfilename) "cstatus_index" btree (cstatus) "nfilectimets_index" btree (nfilectimets) "ntapenr_index" btree (ntapenr)

Example row:
+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+| _rowid | cfilename | nfilesize | nfilectimets | ntapenr | nafiocounter | nblockcounter | cstatus | bonsetblue | bonsetyellow | nlastbackupts |+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+| 1 | /dicom/log/datadir_init.log | 1790 | 2013-01-30 14:02:48 | 0 | 0 | 0 | NEW | f | f | 0001-01-01 00:00:00 |+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+

Mit freundlichen Grüßen
Best regards

Florian Schröck
IT Services

aycan Digitalsysteme GmbH
Innere Aumühlstr. 5
97076 Würzburg . Germany
Tel. +49 (0)9 31. 270 40 88
Fax +49 (0)9 31. 270 40 89
mailto:fschroeck@aycan.de
mailto:store@aycan.de
http://www.aycan.de

Geschäftsführer: Dipl.-Ing. Stephan Popp
Sitz der Gesellschaft: Würzburg
Eingetragen beim Amtsgericht Würzburg unter HRB 6043
Ust-Id Nr. DE 190658226




Ihre Vorteile: www.aycan.de/pacs-wechselWas ist ein VNA?: www.aycan.de/vna

Attachment

aycan_email_footer_pacs-wechsel.png

Re: Very slow update statement on 40mio rows

From

Kevin Grittner

Date:

15 February 2013, 14:59:24

Florian Schröck <fschroeck@aycan.de> wrote:

> UPDATE BackupFiles
>   SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
>       cStatus='NEW'::StatusT, bOnSetBlue=false,
>       bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
>   WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
>
> Explain analyze: http://explain.depesz.com/s/8y5
> The statement takes 60-90 minutes.

The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.

> I tried to optimize the settings but until now without success.
>
> Can we optimize this update statement somehow? Do you have any
> other ideas?

Are there any rows which would already have the values that you are
setting?  If so, it would be faster to skip those by using this
query:

UPDATE BackupFiles
  SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
      cStatus='NEW'::StatusT, bOnSetBlue=false,
      bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
  WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
    AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
         OR cStatus <> 'NEW'::StatusT
         OR bOnSetBlue IS DISTINCT FROM false
         OR bOnSetYellow IS DISTINCT FROM false
         OR nLastBackupTS <> '0001-01-01 00:00:00');

Another way to accomplish this is with the
suppress_redundant_updates_trigger() trigger function:

http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Very slow update statement on 40mio rows

From

Florian Schröck

Date:

15 February 2013, 15:32:14

Hello Kevin,
not updating every row which doesn't need the update solved the problem!
Your query took only 1 minute. :)

Thank you so much for the fast response, have a great weekend!

PS: When you switch to "TEXT" on the explain URL you can see the final
runtime which was 66 minutes with the original statement.

Best regards,
Florian

On 02/15/2013 03:59 PM, Kevin Grittner wrote:
> Florian Schröck <fschroeck@aycan.de> wrote:
>
>> UPDATE BackupFiles
>>    SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
>>        cStatus='NEW'::StatusT, bOnSetBlue=false,
>>        bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
>>    WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
>>
>> Explain analyze: http://explain.depesz.com/s/8y5
>> The statement takes 60-90 minutes.
> The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
> seconds.
>
>> I tried to optimize the settings but until now without success.
>>
>> Can we optimize this update statement somehow? Do you have any
>> other ideas?
> Are there any rows which would already have the values that you are
> setting?  If so, it would be faster to skip those by using this
> query:
>
> UPDATE BackupFiles
>   SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
>       cStatus='NEW'::StatusT, bOnSetBlue=false,
>       bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
>   WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
>     AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
>          OR cStatus <> 'NEW'::StatusT
>          OR bOnSetBlue IS DISTINCT FROM false
>          OR bOnSetYellow IS DISTINCT FROM false
>          OR nLastBackupTS <> '0001-01-01 00:00:00');
>
> Another way to accomplish this is with the
> suppress_redundant_updates_trigger() trigger function:
>
> http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html
>

Re: Very slow update statement on 40mio rows

From

Merlin Moncure

Date:

19 February 2013, 20:04:27

On Fri, Feb 15, 2013 at 9:32 AM, Florian Schröck <fschroeck@aycan.de> wrote:
> Hello Kevin,
> not updating every row which doesn't need the update solved the problem!
> Your query took only 1 minute. :)
>
> Thank you so much for the fast response, have a great weekend!
>
> PS: When you switch to "TEXT" on the explain URL you can see the final
> runtime which was 66 minutes with the original statement.
>
> Best regards,
> Florian
>
> On 02/15/2013 03:59 PM, Kevin Grittner wrote:
>> Florian Schröck <fschroeck@aycan.de> wrote:
>>
>>> UPDATE BackupFiles
>>>    SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
>>>        cStatus='NEW'::StatusT, bOnSetBlue=false,
>>>        bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
>>>    WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
>>>
>>> Explain analyze: http://explain.depesz.com/s/8y5
>>> The statement takes 60-90 minutes.
>> The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
>> seconds.
>>
>>> I tried to optimize the settings but until now without success.
>>>
>>> Can we optimize this update statement somehow? Do you have any
>>> other ideas?
>> Are there any rows which would already have the values that you are
>> setting?  If so, it would be faster to skip those by using this
>> query:
>>
>> UPDATE BackupFiles
>>   SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
>>       cStatus='NEW'::StatusT, bOnSetBlue=false,
>>       bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
>>   WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
>>     AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
>>          OR cStatus <> 'NEW'::StatusT
>>          OR bOnSetBlue IS DISTINCT FROM false
>>          OR bOnSetYellow IS DISTINCT FROM false
>>          OR nLastBackupTS <> '0001-01-01 00:00:00');
>>
>> Another way to accomplish this is with the
>> suppress_redundant_updates_trigger() trigger function:
>>
>> http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html

if the number of rows you actually update is not very large relative
to size of the table, just for fun, try this:


CREATE OR REPLACE FUNCTION BakupFilesCandidateReset(BackupFiles)
RETURNS BOOL AS
$$
  SELECT ($1).cStatus IN('NEW', 'WRITING', 'ONTAPE')
  AND
    (($1).nTapeNr, ($1).nAFIOCounter, ($1).nBlockCounter,
($1).cStatus, ($1).bOnSetBlue, ($1).bOnSetYellow, ($1).nLastBackupTS)
    IS DISTINCT FROM /* simple != will suffice if values are never null */
   (0, 0, 0, 'NEW'::StatusT, false, false, '0001-01-01 00:00:00');
$$ LANGUAGE SQL IMMUTABLE;

CREATE INDEX ON BackupFiles(BakupFilesCandidateReset(BackupFiles))
  WHERE BakupFilesCandidateReset(BackupFiles);


SELECT * FROM BackupFiles WHERE BakupFilesCandidateReset(BackupFiles);
UPDATE BackupFiles SET ... WHERE BakupFilesCandidateReset(BackupFiles);
etc

idea here is to maintain partial boolean index representing candidate
records to update.  plus it's nifty.   this is basic mechanism that
can be used as foundation for very fast push pull queues.

merlin

Re: Very slow update statement on 40mio rows

From

Florian Schröck

Date:

26 February 2013, 08:01:24

Hello Merlin,
thanks for the feedback, I forwarded this to my developer, this is an interesting approach.

Best regards

Florian Schröck

On 02/19/2013 09:04 PM, Merlin Moncure wrote:

On Fri, Feb 15, 2013 at 9:32 AM, Florian Schröck <fschroeck@aycan.de> wrote:

Hello Kevin,
not updating every row which doesn't need the update solved the problem!
Your query took only 1 minute. :)

Thank you so much for the fast response, have a great weekend!

PS: When you switch to "TEXT" on the explain URL you can see the final
runtime which was 66 minutes with the original statement.

Best regards,
Florian

On 02/15/2013 03:59 PM, Kevin Grittner wrote:

Florian Schröck <fschroeck@aycan.de> wrote:

UPDATE BackupFiles  SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,      cStatus='NEW'::StatusT, bOnSetBlue=false,      bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'  WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';

Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes.

The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.

I tried to optimize the settings but until now without success.

Can we optimize this update statement somehow? Do you have any
other ideas?

Are there any rows which would already have the values that you are
setting?  If so, it would be faster to skip those by using this
query:

UPDATE BackupFiles SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,     cStatus='NEW'::StatusT, bOnSetBlue=false,     bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00' WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')   AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0        OR cStatus <> 'NEW'::StatusT        OR bOnSetBlue IS DISTINCT FROM false        OR bOnSetYellow IS DISTINCT FROM false        OR nLastBackupTS <> '0001-01-01 00:00:00');

Another way to accomplish this is with the
suppress_redundant_updates_trigger() trigger function:

http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html

if the number of rows you actually update is not very large relative
to size of the table, just for fun, try this:


CREATE OR REPLACE FUNCTION BakupFilesCandidateReset(BackupFiles)
RETURNS BOOL AS
$$ SELECT ($1).cStatus IN('NEW', 'WRITING', 'ONTAPE') AND   (($1).nTapeNr, ($1).nAFIOCounter, ($1).nBlockCounter,
($1).cStatus, ($1).bOnSetBlue, ($1).bOnSetYellow, ($1).nLastBackupTS)   IS DISTINCT FROM /* simple != will suffice if values are never null */  (0, 0, 0, 'NEW'::StatusT, false, false, '0001-01-01 00:00:00');
$$ LANGUAGE SQL IMMUTABLE;

CREATE INDEX ON BackupFiles(BakupFilesCandidateReset(BackupFiles)) WHERE BakupFilesCandidateReset(BackupFiles);


SELECT * FROM BackupFiles WHERE BakupFilesCandidateReset(BackupFiles);
UPDATE BackupFiles SET ... WHERE BakupFilesCandidateReset(BackupFiles);
etc

idea here is to maintain partial boolean index representing candidate
records to update.  plus it's nifty.   this is basic mechanism that
can be used as foundation for very fast push pull queues.

merlin