Re: Why we lost Uber as a user - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Why we lost Uber as a user |
Date | |
Msg-id | CA+Tgmoaee_23Ms5hasOT0tzBX12mcA5Ont-8hQ2Cvr4ggbzerg@mail.gmail.com Whole thread Raw |
In response to | Re: Why we lost Uber as a user (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Why we lost Uber as a user
|
List | pgsql-hackers |
On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I don't think they are saying that logical replication is more >> reliable than physical replication, nor do I believe that to be true. >> I think they are saying that if logical corruption happens, you can >> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the >> affected rows, whereas if physical corruption happens, there's no >> equally clear path to recovery. > > Well, that's not an entirely unreasonable point, but I dispute the > implication that it makes recovery from corruption an easy thing to do. > How are you going to know what SQL statements to issue? If the master > database is changing 24x7, how are you going to keep up with that? I think in many cases people fix their data using business logic. For example, suppose your database goes down and you have to run pg_resetxlog to get it back up. You dump-and-restore, as one does, and find that you can't rebuild one of your unique indexes because there are now two records with that same PK. Well, what you do is you look at them and judge which one has the correct data, often the one that looks more complete or the one with the newer timestamp. Or, maybe you need to merge them somehow. In my experience helping users through problems of this type, once you explain the problem to the user and tell them they have to square it on their end, the support call ends. The user may not always be entirely thrilled about having to, say, validate a problematic record against external sources of truth, but they usually know how to do it. Database bugs aren't the only way that databases become inaccurate. If the database that they use to keep track of land ownership in the jurisdiction where I live says that two different people own the same piece of property, somewhere there is a paper deed in a filing cabinet. Fishing that out to understand what happened may not be fun, but a DBA can explain that problem to other people in the organization and those people can get it fixed. It's a problem, but it's fixable. On the other hand, if a heap tuple contains invalid infomask bits that cause an error every time you read the page (this actually happened to an EnterpriseDB customer!), the DBA can't tell other people how to fix it and can't fix it personally either. Instead, the DBA calls me. While I try to figure out what happened and solve the problem, every sequential scan on that table fails, so the customer is basically down. In contrast, in the logical corruption scenario, one record might be wrong, but basically everything is still working. So it's a difference between a problem that the DBA can work with coworkers to fix while the system is up, and a problem that the DBA can't fix and the system is meanwhile down. That's a big difference. > I think the realistic answer if you suffer replication-induced corruption > is usually going to be "re-clone that slave", and logical rep doesn't > really offer much gain in that. If you're using multi-master replication, the notion of what's a slave gets a bit fuzzy, but, apart from that, yes, this is often the solution. However, even here, logical replication can be better. Given the right tools, I can fix up the slave incrementally, comparing it to the master row by row and updating anything that's wrong. If I have to rebuild a physical master, I'm offline. The difference doesn't matter if the slave is so badly corrupted that it's unusable, but it's very common for corruption to involve only a handful of records, and many users not unreasonably prefer a database with a couple of corrupted records to one which is totally down. "Hey, the payroll record for that Tom Lane guy is messed up, don't cut his paycheck until we get that straightened out." "OK, no problem." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: