Thread: BUG #12910: Memory leak with logical decoding

BUG #12910: Memory leak with logical decoding

From

pet.slavov@gmail.com

Date:

31 March 2015, 14:36:17

The following bug has been logged on the website:

Bug reference:      12910
Logged by:          Peter Slavov
Email address:      pet.slavov@gmail.com
PostgreSQL version: 9.4.1
Operating system:   Debian wheezy
Description:

Hi,
I am trying to use logical decoding to replay the data modifying queries on
a different server, to synchronize some of the tables. The primary server
has some load, but not so much. I am getting the changes with
pg_logical_slot_get_changes limiting the changes to 50 at a time.
At some point pg_logical_slot_get_changes queries become slow and starts to
eat all the ram and swap, which eventually kills the primary database with
this error:
 FATAL:  the database system is in recovery mode.

The nature of changes on the primary can have a lot of data in one
transaction. Which I guess is the reason of the slow
pg_logical_slot_get_changes.

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

31 March 2015, 14:40:54

Hi,

On 2015-03-27 09:47:57 +0000, pet.slavov@gmail.com wrote:
> I am trying to use logical decoding to replay the data modifying queries on
> a different server, to synchronize some of the tables. The primary server
> has some load, but not so much. I am getting the changes with
> pg_logical_slot_get_changes limiting the changes to 50 at a time.

That's generally not a very good idea. It's much better to use the
streaming interface. Repeatedly starting/stopping a logical slot (as the
SQL interface has to do every time) isn't all that efficient.

> At some point pg_logical_slot_get_changes queries become slow and starts to
> eat all the ram and swap, which eventually kills the primary database with
> this error:
>  FATAL:  the database system is in recovery mode.

That indicates a crash. Please check the server log for any details?  If
it crashes, could you perhaps get a backtrace?

What output plugin are you using?

> The nature of changes on the primary can have a lot of data in one
> transaction. Which I guess is the reason of the slow
> pg_logical_slot_get_changes.

In that case changes should just be spooled to disk.

Greetings,

Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

03 April 2015, 19:56:03

ÐÐ° 31.03.2015 Ð² 17:40, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> On 2015-03-27 09:47:57 +0000, pet.slavov@gmail.com wrote:
>> I am trying to use logical decoding to replay the data modifying queries on
>> a different server, to synchronize some of the tables. The primary server
>> has some load, but not so much. I am getting the changes with
>> pg_logical_slot_get_changes limiting the changes to 50 at a time.
> That's generally not a very good idea. It's much better to use the
> streaming interface. Repeatedly starting/stopping a logical slot (as the
> SQL interface has to do every time) isn't all that efficient.
Actually I tried with pg_recvlogical and result was the same.
>> At some point pg_logical_slot_get_changes queries become slow and starts to
>> eat all the ram and swap, which eventually kills the primary database with
>> this error:
>>   FATAL:  the database system is in recovery mode.
> That indicates a crash. Please check the server log for any details?  If
> it crashes, could you perhaps get a backtrace?
I will try to get a backtrace but from the logs I can see this:
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-1] 2015-03-30
10:59:57 GMT [20905]: [30924-1] user=postgres,db=sumup [local] WARNING:
57P02: terminating connection because of crash of another server process
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-2] 2015-03-30
10:59:57 GMT [20905]: [30925-1] user=postgres,db=sumup [local] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-3] 2015-03-30
10:59:57 GMT [20905]: [30926-1] user=postgres,db=sumup [local] HINT:  In
a moment you should be able to reconnect to the database and repeat your
command.
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-4] 2015-03-30
10:59:57 GMT [20905]: [30927-1] user=postgres,db=sumup [local] CONTEXT:
slot "peshka", output plugin "test_decoding", in the change callback,
associated LSN CC/833C4940
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-5] 2015-03-30
10:59:57 GMT [20905]: [30928-1] user=postgres,db=sumup [local]
LOCATION:  quickdie, postgres.c:2581
Mar 30 10:59:57 peshka-dev postgres[20905]: [15930-6] 2015-03-30
10:59:57 GMT [20905]: [30929-1] user=postgres,db=sumup [local]
STATEMENT:  select * from pg_logical_slot_get_changes('peshka', null, 1);

This happen of course just 1-2 minutes after all the ram and swap is
eaten from postgresql
> What output plugin are you using?
I tryed with test_decoding and decoder_raw from
https://github.com/michaelpq/pg_plugins
The result is the same.
>> The nature of changes on the primary can have a lot of data in one
>> transaction. Which I guess is the reason of the slow
>> pg_logical_slot_get_changes.
> In that case changes should just be spooled to disk.
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

03 April 2015, 19:58:28

Hi I tried to do a backtrace, but I have nothing useful there. Only
lines like this:

Program received signal SIGUSR1, User defined signal 1.
Detaching after fork from child process 13074.
Detaching after fork from child process 13077.
Detaching after fork from child process 13083.
Detaching after fork from child process 13087.
Detaching after fork from child process 13093.
Detaching after fork from child process 13100.
Detaching after fork from child process 13105.
Detaching after fork from child process 13109.
Detaching after fork from child process 13115.
Detaching after fork from child process 13119.

And the process fails again with:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
CONTEXT:  slot "peshka", output plugin "test_decoding", in the change
callback, associated LSN D0/F66B52C8
server closed the connection unexpectedly
     This probably means the server terminated abnormally
     before or while processing the request.
connection to server was lost

Peter Slavov


ÐÐ° 31.03.2015 Ð² 17:40, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> On 2015-03-27 09:47:57 +0000, pet.slavov@gmail.com wrote:
>> I am trying to use logical decoding to replay the data modifying queries on
>> a different server, to synchronize some of the tables. The primary server
>> has some load, but not so much. I am getting the changes with
>> pg_logical_slot_get_changes limiting the changes to 50 at a time.
> That's generally not a very good idea. It's much better to use the
> streaming interface. Repeatedly starting/stopping a logical slot (as the
> SQL interface has to do every time) isn't all that efficient.
>
>> At some point pg_logical_slot_get_changes queries become slow and starts to
>> eat all the ram and swap, which eventually kills the primary database with
>> this error:
>>   FATAL:  the database system is in recovery mode.
> That indicates a crash. Please check the server log for any details?  If
> it crashes, could you perhaps get a backtrace?
>
> What output plugin are you using?
>
>> The nature of changes on the primary can have a lot of data in one
>> transaction. Which I guess is the reason of the slow
>> pg_logical_slot_get_changes.
> In that case changes should just be spooled to disk.
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

06 April 2015, 13:50:36

Hi,

I'm on holidays right now, so my answers will be delayed.

On 2015-04-06 15:35:19 +0300, Peter Slavov wrote:
> Before I start I can see that the pg_xlog directory is 7.2 GB size.
> This give me some idea that  the size of the changes cannot be much bigger
> than that.

There's no such easy correlation. That said, there pretty much never
should be a case where so much memory is needed.

> After I start ti get the transactions changes one by one with select * from
> pg_logical_slot_get_changes('<slot name>', null, 1),

As I said before, it's *not* a good idea to consume transactions
one-by-one. The startup of the decoding machinery is quite expensive. If
you want more control about how much data you get you should use the
streaming interface.

> Maybe I am not understanding something, but is this normal?

It's definitely not normal. It's unfortunately also hard to diagnose
based on the information so far. Any chance you can build a reproducible
test case?

Greetings,

Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

06 April 2015, 14:20:16

Hi Andres,

I am writing again because I did some more testing to see what was the
used resources.
Now I am more sure that this is more a memory leak then anything else.
I extended the swap and the disk space on the server that I was doing
the testing.
Before I start I can see that the pg_xlog directory is 7.2 GB size.
This give me some idea that  the size of the changes cannot be much
bigger than that.
After I start ti get the transactions changes one by one with select *
from pg_logical_slot_get_changes('<slot name>', null, 1),
on one of this statements the process eats all the ram and some space on
the disk and fails again.
This is the eaten resources:
  - 18 GB RAM + swap - 0 free
  - 12 GB in base/pgsql_tmp in the data directory
Again this is comparing with the 7.2 GB in the pg_xlog directory ...

Maybe I am not understanding something, but is this normal? To my
knowledge the data in pg_xlog is not compressed,
so the size cannot just  increase like this ?

Best Regards,
Peter Slavov/
/
ÐÐ° 31.03.2015 Ð² 17:40, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> On 2015-03-27 09:47:57 +0000, pet.slavov@gmail.com wrote:
>> I am trying to use logical decoding to replay the data modifying queries on
>> a different server, to synchronize some of the tables. The primary server
>> has some load, but not so much. I am getting the changes with
>> pg_logical_slot_get_changes limiting the changes to 50 at a time.
> That's generally not a very good idea. It's much better to use the
> streaming interface. Repeatedly starting/stopping a logical slot (as the
> SQL interface has to do every time) isn't all that efficient.
>
>> At some point pg_logical_slot_get_changes queries become slow and starts to
>> eat all the ram and swap, which eventually kills the primary database with
>> this error:
>>   FATAL:  the database system is in recovery mode.
> That indicates a crash. Please check the server log for any details?  If
> it crashes, could you perhaps get a backtrace?
>
> What output plugin are you using?
>
>> The nature of changes on the primary can have a lot of data in one
>> transaction. Which I guess is the reason of the slow
>> pg_logical_slot_get_changes.
> In that case changes should just be spooled to disk.
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

06 April 2015, 18:09:55

Hi Andres,

Thanks for the answer, and happy holidays


ÐÐ° 6.04.2015 Ð² 16:50, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> I'm on holidays right now, so my answers will be delayed.
>
> On 2015-04-06 15:35:19 +0300, Peter Slavov wrote:
>> Before I start I can see that the pg_xlog directory is 7.2 GB size.
>> This give me some idea that  the size of the changes cannot be much bigger
>> than that.
> There's no such easy correlation. That said, there pretty much never
> should be a case where so much memory is needed.
>
>> After I start ti get the transactions changes one by one with select * from
>> pg_logical_slot_get_changes('<slot name>', null, 1),
> As I said before, it's *not* a good idea to consume transactions
> one-by-one. The startup of the decoding machinery is quite expensive. If
> you want more control about how much data you get you should use the
> streaming interface.
I tried with more transactions or without limit there but there is no
difference.
I actually tried before with pg_recvlogical, which is actually using the
streaming interface,
but the result was exactly the same. My guess is that the problem is in
the plug-in actualy..
>> Maybe I am not understanding something, but is this normal?
> It's definitely not normal. It's unfortunately also hard to diagnose
> based on the information so far. Any chance you can build a reproducible
> test case?
Yes I will try to make a test case that is reproducible.
>
> Greetings,
>
> Andres Freund

Thanks,
Peter Slavov

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

09 April 2015, 16:26:06

Hi Andres,

I prepared a test case, that can reproduce the problem.
This is the link to the SQL File
<https://my.pcloud.com/publink/show?code=XZHuLXZomc1T6UJy3XUUJY19wHPejNjNjJk>
.
I am just creating few tables and filling then with data then deleting
and then filling again;
This will produce tables with size ~1.5 GB 2 times and delete them ones.
This queries in the logical decoding
of course will be one row for each inserted and deleted row. Which will
produce ~ 15GB of data for one transactions.
I get the data from terminal like thins:

        psql test_case -c "select * from
pg_logical_slot_get_changes('testing', null, 1);" 1>//tmp/replay.sql

You can see that when the big transaction is red from the WAL files all
this 15GB is going to the RAM and swap space.
Eventually if you get enough RAM and swap the data will be compiled and
after that written to the file.

I know that this setup is not optimal, but it can reproduce the big
memory usage.

Greetings,
Peter Slavov

ÐÐ° 6.04.2015 Ð² 16:50, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> I'm on holidays right now, so my answers will be delayed.
>
> On 2015-04-06 15:35:19 +0300, Peter Slavov wrote:
>> Before I start I can see that the pg_xlog directory is 7.2 GB size.
>> This give me some idea that  the size of the changes cannot be much bigger
>> than that.
> There's no such easy correlation. That said, there pretty much never
> should be a case where so much memory is needed.
>
>> After I start ti get the transactions changes one by one with select * from
>> pg_logical_slot_get_changes('<slot name>', null, 1),
> As I said before, it's *not* a good idea to consume transactions
> one-by-one. The startup of the decoding machinery is quite expensive. If
> you want more control about how much data you get you should use the
> streaming interface.
>
>> Maybe I am not understanding something, but is this normal?
> It's definitely not normal. It's unfortunately also hard to diagnose
> based on the information so far. Any chance you can build a reproducible
> test case?
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

09 April 2015, 17:35:04

Hi,

On 2015-04-09 18:11:27 +0300, Peter Slavov wrote:
> I prepared a test case, that can reproduce the problem.

Yup. I can reproduce it... I did not (yet) have the time to run the test
to completion, but i believe the attached patch should fix the problem
(and also improve performance a bit...).

Using the SQL interface for such large transactions isn't going to be
fun as all of the data, due to the nature of the set returning function
implementation in postgres, will be additionally written into a
tuplestore. The streaming interface doesn't have that behaviour.

Additionally it's probably not a good idea to stream such a large
resultset via SELECT using psql - IIRC it'll try to store all that data
in memory :). Try something like
\copy (select * from pg_logical_slot_peek_changes('testing', null, 1)) TO /tmp/f
or such.

Greetings,

Andres Freund

Attachment

logical-sql-output-leak.diff

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

16 April 2015, 13:12:47

Hi,

On 2015-04-16 15:53:40 +0300, Peter Slavov wrote:
> Can you tell me, at which point after testing this can me included in some
> branch and after in some officials PostgreSQL version ?

I'd hoped you'd confirm that it fixes your 'real' problem ;)

Other than that I plan to commit it to 9.4 and master. It'll then be
included in the next minor (and obviously later major) releases.

Greetings,

Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

16 April 2015, 20:08:23

Hi,

Can you tell me, at which point after testing this can me included in
some branch and after in some officials PostgreSQL version ?

Regards,
Peter Slavov

ÐÐ° 9.04.2015 Ð² 20:34, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> On 2015-04-09 18:11:27 +0300, Peter Slavov wrote:
>> I prepared a test case, that can reproduce the problem.
> Yup. I can reproduce it... I did not (yet) have the time to run the test
> to completion, but i believe the attached patch should fix the problem
> (and also improve performance a bit...).
>
> Using the SQL interface for such large transactions isn't going to be
> fun as all of the data, due to the nature of the set returning function
> implementation in postgres, will be additionally written into a
> tuplestore. The streaming interface doesn't have that behaviour.
>
> Additionally it's probably not a good idea to stream such a large
> resultset via SELECT using psql - IIRC it'll try to store all that data
> in memory :). Try something like
> \copy (select * from pg_logical_slot_peek_changes('testing', null, 1)) TO /tmp/f
> or such.
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

22 April 2015, 15:57:08

On 2015-04-22 18:49:18 +0300, Peter Slavov wrote:
> Sorry for the late answer - I got mixed up with the wrong Postgres version
> and waste time testing on code that is not pached.
> I did testing after that on the patched version and basically I don't see
> difference when I use simple sql statement like before. Psql is putting all
> in RAM/swap before dumping it out ( the size is again ~15-16 GB - no change
> there).

That's a different problem though. Generally psql will, unless you
configure it to use cursors, pull in the whole query in memory.

I thought earlier it wasn't just psql that needed all that memory?

> I tried with copy - much better memory footprint of course.

COPY *did* show the problem for me without the patch.

> I guess I will have to forget using the SQL interface. I will try using
> pg_recvlogical or some other way to connect my python script to the slot
> using the streaming protocol .. thanks

That's a good way independently.

Greetings,

Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Andres Freund

Date:

22 April 2015, 16:16:00

On 2015-04-22 19:02:51 +0300, ÐÐµÑÑÑ Ð¡Ð»Ð°Ð²Ð¾Ð² wrote:
> Are all the connectors based on libpq acting the same like psql?

No. a) you can use cursors b) you can use libpq's rowmode c) if you're
referring to logical decoding specifically you can use the streaming
protocol.

Greetings,

Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Peter Slavov

Date:

22 April 2015, 19:40:19

Hi,

Sorry for the late answer - I got mixed up with the wrong Postgres
version and waste time testing on code that is not pached.
I did testing after that on the patched version and basically I don't
see difference when I use simple sql statement like before. Psql is
putting all in RAM/swap before dumping it out ( the size is again ~15-16
GB - no change there).
I tried with copy - much better memory footprint of course.
I guess I will have to forget using the SQL interface. I will try using
pg_recvlogical or some other way to connect my python script to the slot
using the streaming protocol .. thanks

Peter

ÐÐ° 9.04.2015 Ð² 20:34, Andres Freund Ð½Ð°Ð¿Ð¸ÑÐ°:
> Hi,
>
> On 2015-04-09 18:11:27 +0300, Peter Slavov wrote:
>> I prepared a test case, that can reproduce the problem.
> Yup. I can reproduce it... I did not (yet) have the time to run the test
> to completion, but i believe the attached patch should fix the problem
> (and also improve performance a bit...).
>
> Using the SQL interface for such large transactions isn't going to be
> fun as all of the data, due to the nature of the set returning function
> implementation in postgres, will be additionally written into a
> tuplestore. The streaming interface doesn't have that behaviour.
>
> Additionally it's probably not a good idea to stream such a large
> resultset via SELECT using psql - IIRC it'll try to store all that data
> in memory :). Try something like
> \copy (select * from pg_logical_slot_peek_changes('testing', null, 1)) TO /tmp/f
> or such.
>
> Greetings,
>
> Andres Freund

Re: BUG #12910: Memory leak with logical decoding

From

Петър Славов

Date:

22 April 2015, 19:40:26

<p dir="ltr">Hi,<p dir="ltr">Are all the connectors based on libpq acting the same like psql?<p dir="ltr">Peter<div
class="gmail_quote">На22.04.2015 18:57 "Andres Freund" <<a
href="mailto:andres@anarazel.de">andres@anarazel.de</a>>написа:<br type="attribution" /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2015-04-22 18:49:18 +0300,
PeterSlavov wrote:<br /> > Sorry for the late answer - I got mixed up with the wrong Postgres version<br /> > and
wastetime testing on code that is not pached.<br /> > I did testing after that on the patched version and basically
Idon't see<br /> > difference when I use simple sql statement like before. Psql is putting all<br /> > in
RAM/swapbefore dumping it out ( the size is again ~15-16 GB - no change<br /> > there).<br /><br /> That's a
differentproblem though. Generally psql will, unless you<br /> configure it to use cursors, pull in the whole query in
memory.<br/><br /> I thought earlier it wasn't just psql that needed all that memory?<br /><br /> > I tried with
copy- much better memory footprint of course.<br /><br /> COPY *did* show the problem for me without the patch.<br
/><br/> > I guess I will have to forget using the SQL interface. I will try using<br /> > pg_recvlogical or some
otherway to connect my python script to the slot<br /> > using the streaming protocol .. thanks<br /><br /> That's a
goodway independently.<br /><br /> Greetings,<br /><br /> Andres Freund<br /></blockquote></div>