Thread: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Peter Geoghegan

Date:

19 February 2014, 01:39:58

I've had multiple complaints of apparent data loss on 9.3.2 customer
databases. There are 2 total, both complaints from the past week, one
of which I was able to confirm. The customer's complaint is that
certain rows are either visible or invisible, depending on whether an
index scan is used or a sequential scan (I confirmed this with an
EXPLAIN ANALYZE). The expectation of the customer is that the rows
should always be visible, because he didn't delete them. These are
trivial queries on single trivial tables, but they have foreign keys.

It appears from our internal records that the database that I
confirmed the issue on has always been on 9.3.2 (the heap files have
only been touched by 9.3.2 binaries, although there is more than one
timeline). I cannot swear that this was never on an earlier point
release of 9.3, but it is considered quite unlikely. When I ran a
manual VACUUM FREEZE, I could no longer reproduce the issue (i.e. the
index scan and sequential scan both subsequently indicated that the
row of interest was not visible, and so are at least in agreement). To
me this suggests a problem with MultiXacts. However, I was under the
impression that the 9.3.3 fix "Rework tuple freezing protocol" fixed
an issue that could not manifest in such a way, and so it isn't
obvious to me that this is a known bug. A reading of the 9.3.3 release
notes offers no obvious clues as to what the problem might be. Apart
from the freezing rework, and the "Account for remote row locks
propagated by local updates" item, nothing jumps out, and it isn't
obvious to me how even the most pertinent two items from *.3 might
address relate to this. Have I missed something? Is this likely to be
worth debugging further, to determine if there is an undiscovered bug?
If so, I'm sure I can recover a copy of the data as it was before and
reproduce the problem.

I think that since this database was (very probably) always 9.3.2, we
would not have run the vacuum/vacuum_freeze_table_age amelioration
promoted for those upgrading to that point release (promoted in the
*.2 release notes). On this database, as of right now:
txid_current
──────────────     6242282
(1 row)


--
Peter Geoghegan

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Alvaro Herrera

Date:

19 February 2014, 01:50:44

Peter Geoghegan wrote:
> I've had multiple complaints of apparent data loss on 9.3.2 customer
> databases. There are 2 total, both complaints from the past week, one
> of which I was able to confirm. The customer's complaint is that
> certain rows are either visible or invisible, depending on whether an
> index scan is used or a sequential scan (I confirmed this with an
> EXPLAIN ANALYZE).

The multixact bugs would cause tuples to be hidden at the heap level.
If the tuples are visible in a seqscan, then these are more likely to be
related to index problems, not multixact problem.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Peter Geoghegan

Date:

19 February 2014, 01:56:56

On Tue, Feb 18, 2014 at 5:50 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> The multixact bugs would cause tuples to be hidden at the heap level.
> If the tuples are visible in a seqscan, then these are more likely to be
> related to index problems, not multixact problem.

That was my first suspicion, but then re-indexing didn't help, while
VACUUM FREEZE had the immediate effect of making both plans give a
consistent answer.

-- 
Peter Geoghegan

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Peter Geoghegan

Date:

19 February 2014, 02:03:12

On Tue, Feb 18, 2014 at 5:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
> That was my first suspicion, but then re-indexing didn't help, while
> VACUUM FREEZE had the immediate effect of making both plans give a
> consistent answer.

I should add that this is an unremarkable int4 primary key (which was
dropped and recreated by the customer). There is no obvious reason why
we should see wrong answers that are attributable to application code
or environment.

-- 
Peter Geoghegan

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Peter Geoghegan

Date:

19 February 2014, 02:10:08

On Tue, Feb 18, 2014 at 5:50 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Peter Geoghegan wrote:
>> I've had multiple complaints of apparent data loss on 9.3.2 customer
>> databases. There are 2 total, both complaints from the past week, one
>> of which I was able to confirm. The customer's complaint is that
>> certain rows are either visible or invisible, depending on whether an
>> index scan is used or a sequential scan (I confirmed this with an
>> EXPLAIN ANALYZE).
>
> The multixact bugs would cause tuples to be hidden at the heap level.
> If the tuples are visible in a seqscan, then these are more likely to be
> related to index problems, not multixact problem.

I guess I wasn't clear enough here: The row in question was visible
with a sequential scans but *not* with an index scan. So you have it
backwards here (understandably).


-- 
Peter Geoghegan

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Andres Freund

Date:

19 February 2014, 08:40:20

On 2014-02-18 18:10:02 -0800, Peter Geoghegan wrote:
> On Tue, Feb 18, 2014 at 5:50 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Peter Geoghegan wrote:
> >> I've had multiple complaints of apparent data loss on 9.3.2 customer
> >> databases. There are 2 total, both complaints from the past week, one
> >> of which I was able to confirm. The customer's complaint is that
> >> certain rows are either visible or invisible, depending on whether an
> >> index scan is used or a sequential scan (I confirmed this with an
> >> EXPLAIN ANALYZE).
> >
> > The multixact bugs would cause tuples to be hidden at the heap level.
> > If the tuples are visible in a seqscan, then these are more likely to be
> > related to index problems, not multixact problem.
> 
> I guess I wasn't clear enough here: The row in question was visible
> with a sequential scans but *not* with an index scan. So you have it
> backwards here (understandably).

Was there an index only scan or just a index scan? Any chance of a
corrupted index?

Do you still have the page from before you did the VACUUM?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Peter Geoghegan

Date:

19 February 2014, 08:55:09

On Wed, Feb 19, 2014 at 12:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Was there an index only scan or just a index scan? Any chance of a
> corrupted index?

Just an index scan. I think it's unlikely to be a corrupt index,
because the customer said that he dropped and restored the index, and
it's difficult to imagine how VACUUM FREEZE could then have impacted a
plan with only a sequential scan.

> Do you still have the page from before you did the VACUUM?

It seems likely that I could get it, given that the problem persisted
over several days (apparently there was an earlier, manual attempt to
repair the damage by hand). I'm reasonably confident that I could get
back an affected database by performing a point-in-time recovery. That
would also give me the benefit of an environment that I could break as
needed.

-- 
Peter Geoghegan

Re: Is anyone aware of data loss causing MultiXact bugs in 9.3.2?

From

Andres Freund

Date:

19 February 2014, 09:06:52

On 2014-02-19 00:55:03 -0800, Peter Geoghegan wrote:
> On Wed, Feb 19, 2014 at 12:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Was there an index only scan or just a index scan? Any chance of a
> > corrupted index?
> 
> Just an index scan. I think it's unlikely to be a corrupt index,
> because the customer said that he dropped and restored the index, and
> it's difficult to imagine how VACUUM FREEZE could then have impacted a
> plan with only a sequential scan.

I am wondering whether it's possibly either the vm or the all-visible
flag on the page...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services