Re: Sync Rep v17 - Mailing list pgsql-hackers
From | Yeb Havinga |
---|---|
Subject | Re: Sync Rep v17 |
Date | |
Msg-id | 4D6B6B8D.8040902@gmail.com Whole thread Raw |
In response to | Re: Sync Rep v17 (Jaime Casanova <jaime@2ndquadrant.com>) |
Responses |
Re: Sync Rep v17
|
List | pgsql-hackers |
On 2011-02-25 20:40, Jaime Casanova wrote: > On Fri, Feb 25, 2011 at 10:41 AM, Yeb Havinga<yebhavinga@gmail.com> wrote: >> I also did some initial testing on this patch and got the queue related >> errors with> 1 clients. With the code change from Jaime above I still got a >> lot of 'not on queue warnings'. >> >> I tried to understand how the queue was supposed to work - resulting in the >> changes below that also incorporates a suggestion from Fujii upthread, to >> early exit when myproc was found > yes, looking at the code, the warning and your patch... it seems yours > is the right solution... > I'm compiling right now to test again and see the effects, Robert > maybe you can test your failure case again? i'm really sure it's > related to this... I did some more testing over the weekend with this patched v17 patch. Since you've posted a v18 patch, let me write some findings with the v17 patch before continuing with the v18 patch. The tests were done on a x86_64 platform, 1Gbit network interfaces, 3 servers. Non default configuration changes are copy pasted at the end of this mail. 1) no automatic switch to other synchronous standby - start master server, add synchronous standby 1 - change allow_standalone_primary to off - add second synchronous standby - wait until pg_stat_replication shows both standby's are in STREAMING state - stop standby 1 what happens is that the master stalls, where I expected that it would've switched to standby 2 acknowledge commits. The following thing was pilot error, but since I was test-piloting a new plane, I still think it might be usual feedback. In my opinion, any number and order of pg_ctl stops and starts on both the master and standby servers, as long as they are not with -m immediate, should never cause the state I reached. 2) reaching some sort of shutdown deadlock state - start master server, add synchronous standby - change allow_standalone_primary to off then I did all sorts of test things, everything still ok. Then I wanted to shutdown everything, and maybe because of some symmetry (stack like) I did the following because I didn't think it through - pg_ctl stop on standby (didn't actualy wait until done, but immediately in other terminal) - pg_ctl stop on master O wait.. master needs to sync transactions - start standby again. but now: FATAL: the database system is shutting down There is no clean way to get out of this situation. allow_standalone_primary in the face of shutdowns might be tricky. Maybe shutdown must be prohibited to enter the shutting down phase in allow_standalone_primary = off together with no sync standby, that would allow for the sync standby to attach again. 3) PANIC on standby server At some point a standby suddenly disconnected after I started a new pgbench run on a existing master/standby pair, with the following error in the logfile. LOCATION: libpqrcv_connect, libpqwalreceiver.c:171 PANIC: XX000: heap_update_redo: failed to add tuple CONTEXT: xlog redo hot_update: rel 1663/16411/16424; tid 305453/15; new 305453/102 LOCATION: heap_xlog_update, heapam.c:4724 LOG: 00000: startup process (PID 32597) was terminated by signal 6: Aborted This might be due to pilot error as well; I did a several tests over the weekend and after this error I was more alert on remembering immediate shutdowns/starting with a clean backup after that, and didn't see similar errors since. 4) The performance of the syncrep seems to be quite an improvement over the previous syncrep patches, I've seen tps-ses of O(650) where the others were more like O(20). The O(650) tps is limited by the speed of the standby server I used-at several times the master would halt only because of heavy disk activity at the standby. A warning in the docs might be right: be sure to use good IO hardware for your synchronous replicas! With that bottleneck gone, I suspect the current syncrep version can go beyond 1000tps over 1 Gbit. regards, Yeb Havinga recovery.conf: standby_mode = 'on' primary_conninfo = 'host=mg73 user=repuser password=pwd application_name=standby1' trigger_file = '/tmp/postgresql.trigger.5432' postgresql.conf nondefault parameters: log_error_verbosity = verbose log_min_messages = warning log_min_error_statement = warning listen_addresses = '*' # what IP address(es) to listen on; search_path='\"$user\", public, hl7' archive_mode = on archive_command = 'test ! -f /data/backup_in_progress || cp -i %p /archive/%f < /dev/null' checkpoint_completion_target = 0.9 checkpoint_segments = 16 default_statistics_target = 500 constraint_exclusion = on max_connections = 120 maintenance_work_mem = 128MB effective_cache_size = 1GB work_mem = 44MB wal_buffers = 8MB shared_buffers = 128MB wal_level = 'archive' max_wal_senders = 4 wal_keep_segments = 1000 # 16000MB (for production increase this) synchronous_standby_names = 'standby1,standby2,standby3' synchronous_replication = on allow_standalone_primary = off
pgsql-hackers by date: