Re: Support for N synchronous standby servers - take 2 - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: Support for N synchronous standby servers - take 2 |
Date | |
Msg-id | 20151214.182738.130827803.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Support for N synchronous standby servers - take 2 (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Support for N synchronous standby servers - take 2
|
List | pgsql-hackers |
Thank you for the new patch. At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg@mail.gmail.com> > On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I agree with #3 way and the s_s_name format you suggested. > > I think that It's extensible and is tolerable for future changes. > > I'm going to implement the patch based on this idea if other hackers > > agree with this design. > > Please find the attached draft patch which supports multi sync replication. > This patch adds a GUC parameter synchronous_replication_method, which > represent the method of synchronous replication. > > [Design of replication method] > synchronous_replication_method has two values; 'priority' and > '1-priority' for now. > We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future. > > * s_r_method = '1-priority' > This method is for backward compatibility, so the syntax of s_s_names > is same as today. > The behavior is same as well. > > * s_r_method = 'priority' > This method is for multiple synchronous replication using priority method. > The syntax of s_s_names is, > <number of sync standbys>, <standby name> [, ...] Is there anyone opposed to this? > For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, > node3' means that the master waits for acknowledge from at least 2 > lowest priority servers. > If 4 standbys(node1 - node4) are available, the master server waits > acknowledge from 'node1' and 'node2. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node2 | sync > node3 | potential > node4 | async > (4 rows) > > After 'node2' crashed, the master will wait for acknowledge from > 'node1' and 'node3'. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node3 | sync > node4 | async > (3 rows) > > [Changing replication method] > When we want to change the replication method, we have to change the > s_r_method at first, and then do pg_reload_conf(). > After changing replication method, we can change the s_s_names. Mmm. I should be able to be changed at once, because s_r_method and s_s_names contradict each other during the intermediate state. > [Expanding replication method] > If we want to expand new replication method additionally, we need to > implement two functions for each replication method: > * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) > This function obtains the list of standbys considered as synchronous > at that time, and return its length. > * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > This function obtains LSNs(write, flush) considered as synced. > > Also, this patch debug code is remain yet, you can debug this behavior > using by enable DEBUG_REPLICATION macro. > > Please give me feedbacks. I haven't looked into this fully (sorry) but I'm concerned about several points. - I feel that some function names looks too long. For example SyncRepGetSynchronousStandbysOnePriority occupies more thanthe half of a line. (However, the replication code alrady has many long function names..) - The comment below of SyncRepGetSynchronousStandbyOnePriority, > /* Find lowest priority standby */ The code where the comment is for is doing the correct thing. Howerver, the comment is confusing. A lower priority *value*means a higher priority. - SyncRepGetSynchronousStandbys checks all if()s even when the first one matches. Use switch or "else if" there if you theyare exclusive each other. - Do you intende the DEBUG_REPLICATION code in SyncRepGetSynchronousStandbys*() to be the final shape? The same code blockswhich can work for both method should be in their common caller but SyncRepGetSyncLsns*() are headache. Although itmight need more refactoring, I'm sorry but I don't see a desirable shape for now. By the way, palloc(20)/free() in such short term looks ineffective. - SyncRepGetSyncLsnsPriority For the comment "/* Find lowest XLogRecPtr of both write and flush from sync_nodes */", LSN is compared as early or lateso the comment would be better to be something like "Keep/Collect the earliest write and flush LSNs among prioritizedstandbys". And what is more important, this block handles write and flush LSN jumbled and it reults in missing the earliest(= mostdelayed) LSN for certain cases. The following is an example. Standby 1: write LSN = 10, flush LSN = 5 Standby 2: write LSN = 8 , flush LSN = 6 For this case, finally we get tmp_write = 10 and tmp_flush = 5 from the current code, where tmp_write has wrong value sinceLSN = 10 has *not* been written yet on standby 2. (the names "tmp_*" don't seem appropriate here) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: