Re: Replication identifiers, take 4 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Replication identifiers, take 4 |
Date | |
Msg-id | 20150216002155.GI15326@awork2.anarazel.de Whole thread Raw |
In response to | Replication identifiers, take 3 (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: Replication identifiers, take 4
Re: Replication identifiers, take 4 Re: Replication identifiers, take 4 |
List | pgsql-hackers |
Hi, Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. Other changes: * Locking preventing several backends to replay changes at the same time. This is actually overly restrictive in some cases,but I think good enough for now. * Logical decoding grew a filter_by_origin callback that allows to ignore changes that were replayed on a remote system.Such filters are executed before much is done with records, potentially saving a fair bit of costs. * Rebased. That took a bit due the xlog and other changes. * A couple more SQL interface functions (like dropping a replication identifier). I also want to quickly recap replication identifiers, given that in-person conversations with several people proved that the concept was slightly misunderstood: Think about a logical replication solution trying to replay changes. The postmaster in which the data is replayed into crashes every now and then. Replication identifiers allow you to do something like: do_replication() { source = ConnectToSourceSystem('mysource'); target = ConnectToSourceSystem('target'); # mark we're replayin target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$); #get how far we've replayed last time round remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progressWHERE external_id = 'myrep_mysource'); # and now replay changes copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn); while (record = copystream.get_record()) { if (record.type = 'begin') { target.exec('BEGIN'); # setup the position of this individual xact target.exec('SELECT pg_replication_identifier_setup_tx_origin($1,$2);', record.origin_lsn, record.origin_commit_timestamp); } else if (record.type = 'change') target.exec(record.change_sql) else if (record.type = 'commit') target.exec('COMMIT'); } } A non pseudocode version of the above would be safe against crashes of both the source and the target system. If the target system crashes the replication identifier logic will recover how far we replayed during crash recovery. If the source system crashes/disconnects we'll have the current value in memory. Note that this works perfectly well if the target system (and obviously the source system, but that's obvious) use synchronous_commit = off - we'll not miss any changes. Furthermore the fact that the origin of records is recorded allows to avoid decoding them in logical decoding. That has both efficiency advantages (we can do so before they are stored in memory/disk) and functionality advantages. Imagine using a logical replication solution to replicate inserts to a single table between two databases where inserts are allowed on both - unless you prevent the replicated inserts from being replicated again you obviously have a loop. This infrastructure lets you avoid that. The SQL interface consists out of: # manage existance of identifiers internal_id pg_replication_identifier_create(external_id); void pg_replication_identifier_drop(external_id); # replay management void pg_replication_identifier_setup_replaying_from(external_id); void pg_replication_identifier_reset_replaying_from(); bool pg_replication_identifier_is_replaying(); void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time); # replication progress status view SELECT * FROM pgreplication_identifier_progress; # replicatation identifiers SELECT * FROM pg_replication_identifier; Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a SQL reimplementation of replication identifiers and that has proven that for busier workloads doing a table update to store the replication progress indeed has a noticeable overhead. Especially if there's some longer running activity on the standby. The bigger questions I have are: 1) Where to store the origin. I personally still think that using the padding is fine. Now that I have proven that it'spretty simple to store additional information the argument that it might be needed for something else doesn't reallyhold anymore. But I can live with the other solution as well - 3 bytes additional overhead ain't so bad. 2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do we want to store the origin only on relevant records?That'd be XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE // XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED.I'm thinking of something like XLogLogOriginIfAvailable() before the emittinglog XLogInsert()s. 3) There should be a lwlock for the individual replication identifier progress slots. 4) Right now identifier progress is stored during checkpoints in special files - maybe it'd be better to store them insidethe checkpoint record somehow. We read that even after a clean shutdown, so that should be fine. 5) I'm think there are issues with a streaming replication standby if many identifiers are created/dropped. Those shouldn'tbe too hard to fix. 6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't acceptable. Either I need to implement boostrapsupport for marking varlenas NOT NULL as discussed nearby or replace the syscache lookup with a index lookup. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: