Thread: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16792 Logged by: pawel kudzia Email address: kudzia@gmail.com PostgreSQL version: 11.10 Operating system: Debian Buster x86_64 Description: I'd like to ask for advice how to proceed with pin-pointing cause of the silent corruption of GIN index that I'm facing. Relevant part of the table structure: CREATE TABLE public.entity ( entity_id bigint NOT NULL, attribute_value_ids integer[] NOT NULL, attribute_name_ids integer[] NOT NULL, [..] ); ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY (entity_id); CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids public.gin__int_ops); CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin (attribute_value_ids public.gin__int_ops); How does the issue manifest? Queries which use GIN on integer[] column occasionally return too many rows, including ones that actually do not match criteria expressed in WHERE. Queries like below should never return any rows, yet - occasionally they do return some results: data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' ) AND NOT ( (attribute_name_ids||0) && '{4980}') ; entity_id ----------- 31213924 31195117 31209184 (3 rows) Query plan: QUERY PLAN ------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=50.39..3068.11 rows=26923 width=22) Recheck Cond: (attribute_name_ids && '{4980}'::integer[]) Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[])) -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..50.39 rows=27058 width=0) Index Cond: (attribute_name_ids && '{4980}'::integer[]) (5 rows) The query was crafted intentionally to let PostgreSQL use GIN index entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows that after inspection of their values do not actually match that criteria. What have I done so far? Originally the problem was discovered when running PostgreSQL 11.7 from Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from Debian's repo and run: reindex (verbose) table entity; vacuum (verbose, full, analyze) entity; After few days of updates the problem returned. We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian and reindexed / vacuumed again. After few weeks of updates problem returned again. Other information worth noting: * table in question has 38M rows and is the only table created after PostgreSQL installation * server in question is master of streaming replication; issue occurs also on the slave servers - this suggests corruption of the on-disk GIN index data * just rewriting rows falsely returned by select above - UPDATE entity SET attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue * we've set up rudimentary consistency checks that allow to catch some of the corruptions within 1h of occurring - so far corruptions only happen to rows that have been updated since the previous check * server in question handles heavy read/write traffic * PostgreSQL in question runs in LXC container with Debian Buster running on top of Debian Buster running on top of bare-metal server * it's highly unlikely that hardware malfunction is to be blamed - the same physical server handles also few MySQL instances with hundreds of GB of data with heavy consistency checks cross checking content of MySQL with PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and de-serializing objects stored there. if there was a bit-rot / bit-flip in memory would that's not detected or fixed by ECC - high level consistency checks would pick it. * despite numerous attempts i cannot reproduce the issue in test environment, possibly due to much lower level of write traffic. it takes days to weeks to have it re-occur on the production server. Thank you in advance for your suggestions how to tackle this. best regards, Pawel
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Sun, Dec 27, 2020 at 6:19 PM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 16792 > Logged by: pawel kudzia > Email address: kudzia@gmail.com > PostgreSQL version: 11.10 > Operating system: Debian Buster x86_64 > Description: > > I'd like to ask for advice how to proceed with pin-pointing cause of the > silent corruption of GIN index that I'm facing. > > Relevant part of the table structure: > > CREATE TABLE public.entity ( > entity_id bigint NOT NULL, > attribute_value_ids integer[] NOT NULL, > attribute_name_ids integer[] NOT NULL, > [..] > ); > > ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY > (entity_id); > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin > (attribute_name_ids public.gin__int_ops); > CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin > (attribute_value_ids public.gin__int_ops); > > How does the issue manifest? > > Queries which use GIN on integer[] column occasionally return too many rows, > including ones that actually do not match criteria expressed in WHERE. > Queries like below should never return any rows, yet - occasionally they do > return some results: > > data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' ) > AND NOT ( (attribute_name_ids||0) && '{4980}') ; > entity_id > ----------- > 31213924 > 31195117 > 31209184 > (3 rows) > > Query plan: > QUERY PLAN > ------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on entity (cost=50.39..3068.11 rows=26923 width=22) > Recheck Cond: (attribute_name_ids && '{4980}'::integer[]) > Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[])) > -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..50.39 > rows=27058 width=0) > Index Cond: (attribute_name_ids && '{4980}'::integer[]) > (5 rows) > > The query was crafted intentionally to let PostgreSQL use GIN index > entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows > that after inspection of their values do not actually match that criteria. > > What have I done so far? > > Originally the problem was discovered when running PostgreSQL 11.7 from > Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from > Debian's repo and run: > > reindex (verbose) table entity; vacuum (verbose, full, analyze) entity; > > After few days of updates the problem returned. > > We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian > and reindexed / vacuumed again. After few weeks of updates problem returned > again. > > Other information worth noting: > * table in question has 38M rows and is the only table created after > PostgreSQL installation > * server in question is master of streaming replication; issue occurs also > on the slave servers - this suggests corruption of the on-disk GIN index > data > * just rewriting rows falsely returned by select above - UPDATE entity SET > attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue > * we've set up rudimentary consistency checks that allow to catch some of > the corruptions within 1h of occurring - so far corruptions only happen to > rows that have been updated since the previous check > * server in question handles heavy read/write traffic > * PostgreSQL in question runs in LXC container with Debian Buster running on > top of Debian Buster running on top of bare-metal server > * it's highly unlikely that hardware malfunction is to be blamed - the same > physical server handles also few MySQL instances with hundreds of GB of data > with heavy consistency checks cross checking content of MySQL with > PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and > de-serializing objects stored there. if there was a bit-rot / bit-flip in > memory would that's not detected or fixed by ECC - high level consistency > checks would pick it. > * despite numerous attempts i cannot reproduce the issue in test > environment, possibly due to much lower level of write traffic. it takes > days to weeks to have it re-occur on the production server. > > Thank you in advance for your suggestions how to tackle this. > > best regards, > Pawel > I'd like to provide more details from another occurrence of this issue: psql (11.7 (Debian 11.7-0+deb10u1), server 11.9 (Debian 11.9-0+deb10u1)) Example where GIN is returning misleading information: data=# SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{471}' ) AND NOT ( (attribute_name_ids||0) && '{471}') ; entity_id | attribute_name_ids -----------+------------------------------------------------------------------------------ 197911914 | {2,3,4,6,7,8,9,10,13,17,35,48,66,71,3270,3273,3279,4377,4842,5379,5787,7416} (1 row) data=# EXPLAIN ANALYZE SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{471}' ) AND NOT ( (attribute_name_ids||0) && '{471}') ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=71.04..3432.73 rows=29881 width=126) (actual time=64.179..65.255 rows=1 loops=1) Recheck Cond: (attribute_name_ids && '{471}'::integer[]) Filter: (NOT ((attribute_name_ids || 0) && '{471}'::integer[])) Rows Removed by Filter: 8814 Heap Blocks: exact=7877 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..71.04 rows=30031 width=0) (actual time=10.396..10.398 rows=13628 loops=1) Index Cond: (attribute_name_ids && '{471}'::integer[]) Planning Time: 0.164 ms Execution Time: 65.458 ms (9 rows) When GIN is not used results are as expected: data=# EXPLAIN ANALYZE SELECT entity_id,attribute_name_ids FROM entity WHERE entity_id=197911914 AND attribute_name_ids && '{471}'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using entity_pkey on entity (cost=0.00..0.20 rows=1 width=126) (actual time=0.090..0.091 rows=0 loops=1) Index Cond: (entity_id = 197911914) Filter: (attribute_name_ids && '{471}'::integer[]) Rows Removed by Filter: 1 Planning Time: 0.233 ms Execution Time: 0.116 ms (6 rows) data=# SELECT entity_id,attribute_name_ids FROM entity WHERE entity_id=197911914 AND attribute_name_ids && '{471}'; entity_id | attribute_name_ids -----------+-------------------- (0 rows) From my observations introducing of misleading entry in GIN coincides with update of database row that will be incorrectly reported. Fixing the issue can be done by: UPDATE entity SET attribute_name_ids ='{}' WHERE entity_id=197911914; UPDATE entity SET attribute_name_ids ='{2,3,4,6,7,8,9,10,13,17,35,48,66,71,3270,3273,3279,4377,4842,5379,5787,7416}' WHERE entity_id=197911914'; The table has following structure: Column | Type | Collation | Nullable | Default -----------------------+--------------------------------+-----------+----------+--------------- entity_id | bigint | | not null | attribute_value_ids | integer[] | | not null | attribute_name_ids | integer[] | | not null | [..] Indexes: "entity_pkey" PRIMARY KEY, btree (entity_id) "entity_attribute_name_ids_gin" gin (attribute_name_ids gin__int_ops) "entity_attribute_value_ids_gin" gin (attribute_value_ids gin__int_ops) [..] Disk footprint of it is ~ 60GB, ~38M rows. In fact it's a single table in this database. The table gets tens of millions of updates each day and maybe 10-100k inserts with very little deletes. I have taken a copy of database files from streaming replication slave server. i also have WAL files from around the time that this particular corruption occurred. Is there any additional information I can provide? thank you! -- regards, Pawel Kudzia
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
Based on the suggestions we've received from IRC we've: * Moved some of our postgresql instances to another physical server, * Enabled data_checksums on all instances, * Upgraded upgraded our servers to 13.3 (Debian 13.3-1.pgdg100+1) with hope that https://github.com/postgres/postgres/commit/0d779d22a290a89b6c892137a37280b9588ad0cc addresses the issue [ although it's ts-vector specific, so unlikely to be related to our issue ]. Yet soon after we've observe another inconsistency. Recap of the problem - SELECTs return rows, based on GIN, that do not actually meet criteria expressed in WHERE. SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) && '{7572}') LIMIT 100 ; entity_id | attribute_name_ids -----------+---------------------------------------------------------------------------------------- 22327791 | {1,2,3,4,6,8,9,10,11,13,14,17,19,21,35,72,366,1659,2208,2211,3270,3273,3279,5787,7650} (1 row) EXPLAIN ANALYZE of the query above: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=35.46..46.67 rows=100 width=132) (actual time=307.221..335.842 rows=1 loops=1) -> Bitmap Heap Scan on entity (cost=35.46..3705.32 rows=32724 width=132) (actual time=307.220..335.839 rows=1 loops=1) Recheck Cond: (attribute_name_ids && '{7572}'::integer[]) Rows Removed by Index Recheck: 72012 Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[])) Rows Removed by Filter: 21501 Heap Blocks: exact=8998 lossy=9257 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..35.46 rows=32889 width=0) (actual time=19.790..19.790 rows=115485 loops=1) Index Cond: (attribute_name_ids && '{7572}'::integer[]) Planning Time: 0.068 ms Execution Time: 335.879 ms (11 rows) Relevant part of the table structure: CREATE TABLE public.entity ( entity_id bigint NOT NULL, attribute_value_ids integer[] NOT NULL, attribute_name_ids integer[] NOT NULL, [..] ); ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY (entity_id); CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids public.gin__int_ops); CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin (attribute_value_ids public.gin__int_ops); There are at leat tens of updates / inserts per second to the table all the time. the issue above manifests very rarely, after few days if not weeks of uptime. we did not find any deterministic way of reproducing it, but it's a matter of time till it shows. The issue persist is related to on-disk data, it replicates from streaming replication masters to slaves. Thank you in advance for your suggestions how to tackle this. -- regards, Pawel Kudzia
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
with help from IRC we've found that decreasing work_mem from 1MB to 256kB or less makes the problem go away: data=# show work_mem; work_mem ---------- 256kB (1 row) data=# SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) && '{7572}') LIMIT 100 ; entity_id | attribute_name_ids -----------+-------------------- (0 rows) data=# explain analyze SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) && '{7572}') LIMIT 100 ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=26.53..38.04 rows=100 width=132) (actual time=110.013..110.015 rows=0 loops=1) -> Bitmap Heap Scan on entity (cost=26.53..3780.78 rows=32606 width=132) (actual time=110.011..110.011 rows=0 loops=1) Recheck Cond: (attribute_name_ids && '{7572}'::integer[]) Rows Removed by Index Recheck: 102983 Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[])) Rows Removed by Filter: 21501 Heap Blocks: exact=898 lossy=13752 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..26.53 rows=32770 width=0) (actual time=3.582..3.583 rows=21518 loops=1) Index Cond: (attribute_name_ids && '{7572}'::integer[]) Planning Time: 0.173 ms Execution Time: 110.220 ms (11 rows) problem manifests again with work_mem increased to 512kB or higher: data=# show work_mem; work_mem ---------- 512kB (1 row) data=# SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) && '{7572}') LIMIT 100 ; entity_id | attribute_name_ids -----------+---------------------------------------------------------------------------------------- 22327791 | {1,2,3,4,6,8,9,10,11,13,14,17,19,21,35,72,366,1659,2208,2211,3270,3273,3279,5787,7650} (1 row) data=# explain analyze SELECT entity_id,attribute_name_ids FROM entity WHERE ( attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) && '{7572}') LIMIT 100 ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=26.73..38.14 rows=100 width=132) (actual time=112.268..119.475 rows=1 loops=1) -> Bitmap Heap Scan on entity (cost=26.73..3748.28 rows=32606 width=132) (actual time=112.267..119.473 rows=1 loops=1) Recheck Cond: (attribute_name_ids && '{7572}'::integer[]) Rows Removed by Index Recheck: 68905 Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[])) Rows Removed by Filter: 21501 Heap Blocks: exact=5630 lossy=9012 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..26.73 rows=32770 width=0) (actual time=3.924..3.924 rows=21518 loops=1) Index Cond: (attribute_name_ids && '{7572}'::integer[]) Planning Time: 0.113 ms Execution Time: 119.801 ms (11 rows) "vacuum table entity;" did not help, neither did "select gin_clean_pending_list('entity_attribute_name_ids_gin');"
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Tom Lane
Date:
Pawel Kudzia <kudzia@gmail.com> writes: > with help from IRC we've found that decreasing work_mem from 1MB to 256kB > or less makes the problem go away: Hmm. So that suggests that the index itself is *not* corrupt, but the problem is associated with a bug in the indexscan algorithms. Have you experimented with different index opclasses? Your original report said you were using gin__int_ops, but that's not the default, and we sort of deprecate intarray these days. regards, tom lane
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Tom Lane
Date:
I wrote: > Pawel Kudzia <kudzia@gmail.com> writes: >> with help from IRC we've found that decreasing work_mem from 1MB to 256kB >> or less makes the problem go away: > Hmm. So that suggests that the index itself is *not* corrupt, > but the problem is associated with a bug in the indexscan > algorithms. After staring at the code I think there is at least one bug, and possibly two, in shimTriConsistentFn. That's likely to be implicated here because intarray's GIN opclass only provides a bool consistent function. I'm not very clear on the circumstances that lead to the scan code inventing GIN_MAYBE inputs, so I haven't been able to construct a test case. The definite bug is triggered because intarray relies on the API specification that says that if the search mode is GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called when there's at least one TRUE key: case RTOverlapStrategyNumber: /* result is not lossy */ *recheck = false; /* at least one element in check[] is true, so result = true */ res = true; break; shimTriConsistentFn ignores this and calls it on all-FALSE inputs, for which it'll incorrectly get a TRUE result, as it will also for all the following checks. The upshot is that shimTriConsistentFn will convert any case with a MAYBE input to a hard TRUE result, with no recheck requirement. This'd easily explain the reported misbehavior. So in the spot where we do this: /* First call consistent function with all the maybe-inputs set FALSE */ for (i = 0; i < nmaybe; i++) key->entryRes[maybeEntries[i]] = GIN_FALSE; curResult = directBoolConsistentFn(key); we need to add some code that checks for default searchMode, and in that case doesn't call the consistentFn unless at least one (non-MAYBE) input is TRUE. The other thing that I'm unsure about, because the code is horribly underdocumented, is that it's not very clear whether shimTriConsistentFn is accurately converting between the bool and the tristate APIs. That's because it's not very clear what the tristate API actually *is*. In particular, is the end state of key->recheckCurItem meaningful in the tristate case? If it's not, then the short-circuit case for no MAYBE inputs is broken, because it'll return TRUE when the bool consistentFn is trying to tell us to recheck. But if it is meaningful, then the multiway case is broken, because it will return the recheckCurItem result set by the last call to the bool consistentfn; which might be false even though other calls said true. (So in that case I think we'd need "key->recheckCurItem = recheck" at the end.) I also wonder how any of that works correctly for real triconsistent functions, which don't have access to the recheckCurItem flag. Anyway, I'm punting this to the code authors. I'd like to see some comments clarifying what the API really is, not just a quick-n-dirty code fix. regards, tom lane
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > Pawel Kudzia <kudzia@gmail.com> writes: > >> with help from IRC we've found that decreasing work_mem from 1MB to 256kB > >> or less makes the problem go away: > > > Hmm. So that suggests that the index itself is *not* corrupt, > > but the problem is associated with a bug in the indexscan > > algorithms. > > After staring at the code I think there is at least one bug, and > possibly two, in shimTriConsistentFn. That's likely to be implicated > here because intarray's GIN opclass only provides a bool consistent > function. I'm not very clear on the circumstances that lead to the scan > code inventing GIN_MAYBE inputs, so I haven't been able to construct a > test case. > > The definite bug is triggered because intarray relies on the API > specification that says that if the search mode is > GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called > when there's at least one TRUE key: > > case RTOverlapStrategyNumber: > /* result is not lossy */ > *recheck = false; > /* at least one element in check[] is true, so result = true */ > res = true; > break; > > shimTriConsistentFn ignores this and calls it on all-FALSE inputs, > for which it'll incorrectly get a TRUE result, as it will also for > all the following checks. The upshot is that shimTriConsistentFn > will convert any case with a MAYBE input to a hard TRUE result, > with no recheck requirement. This'd easily explain the reported > misbehavior. So in the spot where we do this: > > /* First call consistent function with all the maybe-inputs set FALSE */ > for (i = 0; i < nmaybe; i++) > key->entryRes[maybeEntries[i]] = GIN_FALSE; > curResult = directBoolConsistentFn(key); > > we need to add some code that checks for default searchMode, and in > that case doesn't call the consistentFn unless at least one > (non-MAYBE) input is TRUE. +1, I think in default search mode we can just start with curResult equal to GIN_FALSE instead of calling directBoolConsistentFn(). > The other thing that I'm unsure about, because the code is horribly > underdocumented, is that it's not very clear whether > shimTriConsistentFn is accurately converting between the bool and > the tristate APIs. That's because it's not very clear what the > tristate API actually *is*. In particular, is the end state of > key->recheckCurItem meaningful in the tristate case? If it's not, > then the short-circuit case for no MAYBE inputs is broken, because > it'll return TRUE when the bool consistentFn is trying to tell us > to recheck. But if it is meaningful, then the multiway case is broken, > because it will return the recheckCurItem result set by the last call to > the bool consistentfn; which might be false even though other calls said > true. (So in that case I think we'd need "key->recheckCurItem = recheck" > at the end.) I also wonder how any of that works correctly for real > triconsistent functions, which don't have access to the recheckCurItem > flag. As far as I recall, returning MAYBE for no MAYBE inputs in the triConsistent function is equivalent to setting the recheck flag in the bool consistent function. Thus, short-circuit case for no MAYBE inputs should be broken. > Anyway, I'm punting this to the code authors. I'd like to see > some comments clarifying what the API really is, not just a > quick-n-dirty code fix. Yep, I'm going to have a closer look at this tomorrow. ------ Regards, Alexander Korotkov
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> we need to add some code that checks for default searchMode, and in >> that case doesn't call the consistentFn unless at least one >> (non-MAYBE) input is TRUE. > I think in default search mode we can just start with curResult equal > to GIN_FALSE instead of calling directBoolConsistentFn(). No, it's not that simple, because there might be other keys that are TRUE not MAYBE. So the result could legitimately be TRUE in this case. BTW, I think it'd be a really good idea for this function to restore all the MAYBE entries to GIN_MAYBE before returning, so that you can remove the caveat that it destroys the contents of entryRes[]. I have basically zero faith that that's safe, and it seems pretty cheap to not have to make the assumption. regards, tom lane
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
On Fri, Jun 18, 2021 at 1:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> we need to add some code that checks for default searchMode, and in > >> that case doesn't call the consistentFn unless at least one > >> (non-MAYBE) input is TRUE. > > > I think in default search mode we can just start with curResult equal > > to GIN_FALSE instead of calling directBoolConsistentFn(). > > No, it's not that simple, because there might be other keys that are > TRUE not MAYBE. So the result could legitimately be TRUE in this case. Yes, that's it. Thank you for correction. > BTW, I think it'd be a really good idea for this function to restore > all the MAYBE entries to GIN_MAYBE before returning, so that you can > remove the caveat that it destroys the contents of entryRes[]. I have > basically zero faith that that's safe, and it seems pretty cheap to > not have to make the assumption. +1, sounds like a good idea. ------ Regards, Alexander Korotkov
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
On Fri, Jun 18, 2021 at 12:55 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Anyway, I'm punting this to the code authors. I'd like to see > > some comments clarifying what the API really is, not just a > > quick-n-dirty code fix. > > Yep, I'm going to have a closer look at this tomorrow. Sorry for the delay. I'm going to take a closer look in the next couple of days. ------ Regards, Alexander Korotkov
IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 17/06/2021 22:49, Tom Lane wrote: > I wrote: >> Pawel Kudzia <kudzia@gmail.com> writes: >>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB >>> or less makes the problem go away: > >> Hmm. So that suggests that the index itself is *not* corrupt, >> but the problem is associated with a bug in the indexscan >> algorithms. > > After staring at the code I think there is at least one bug, and > possibly two, in shimTriConsistentFn. That's likely to be implicated > here because intarray's GIN opclass only provides a bool consistent > function. I'm not very clear on the circumstances that lead to the scan > code inventing GIN_MAYBE inputs, so I haven't been able to construct a > test case. > > The definite bug is triggered because intarray relies on the API > specification that says that if the search mode is > GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called > when there's at least one TRUE key: > > case RTOverlapStrategyNumber: > /* result is not lossy */ > *recheck = false; > /* at least one element in check[] is true, so result = true */ > res = true; > break; > > shimTriConsistentFn ignores this and calls it on all-FALSE inputs, > for which it'll incorrectly get a TRUE result, as it will also for > all the following checks. The upshot is that shimTriConsistentFn > will convert any case with a MAYBE input to a hard TRUE result, > with no recheck requirement. This'd easily explain the reported > misbehavior. That's subtle. The documentation says: > If <literal>*searchMode</literal> is set to > <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is > initialized to before call), only items that match at least one of > the returned keys are considered candidate matches. I guess you can read that as "consistentFn will only be called when there is at least one matching item", but that didn't occur to me when I read that at first. But I agree we need to fix shimTriConsistentFn to respect that. > The other thing that I'm unsure about, because the code is horribly > underdocumented, is that it's not very clear whether > shimTriConsistentFn is accurately converting between the bool and > the tristate APIs. That's because it's not very clear what the > tristate API actually *is*. In particular, is the end state of > key->recheckCurItem meaningful in the tristate case? If it's not, > then the short-circuit case for no MAYBE inputs is broken, because > it'll return TRUE when the bool consistentFn is trying to tell us > to recheck. But if it is meaningful, then the multiway case is broken, > because it will return the recheckCurItem result set by the last call to > the bool consistentfn; which might be false even though other calls said > true. (So in that case I think we'd need "key->recheckCurItem = recheck" > at the end.) I also wonder how any of that works correctly for real > triconsistent functions, which don't have access to the recheckCurItem > flag. > > Anyway, I'm punting this to the code authors. I'd like to see > some comments clarifying what the API really is, not just a > quick-n-dirty code fix. I've been trying to create a test case for this, but no luck so far. I cannot find a codepath that would hit these bugs with the kind of query that Pawel used. The search condition is very simple: "column && '{constant}'", with only one key and one constant to search for. There are three calls to triConsistentFn: 1. In startScanKey, but that's only reached if (key->nentries > 1), so not reachable with Pawel's query. 2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry is never true with Pawel's query. That would require curItem to be a lossy pointer, and that in turn can only happen when matchBitmap is used, and that's only used with partial match queries and with queries that need a full-index scan. 3. The second call in keyGetItem is reached, but it is only passed GIN_MAYBE when curItem is lossy. Which isn't possible with this query, see point 2. AFAICS the recheckCurItem bug should also not cause wrong results with that query. I must be missing something. I could probably construct a test case with a different query, one involving multiple elements in the search key, but I'd like to have a solid explanation for the original report. Pawel, is this a production system, or would it be possible for you to build from sources with a patch with some extra debugging output? - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
> > Pawel, is this a production system, or would it be possible for you to > build from sources with a patch with some extra debugging output? > thank you for looking into it! it is a production system but we have couple of replicas. when the problem occurs - it can be reproduced on any replica as well. when it happens again - i can stop one of replicas, take file level backup and set up a test machine on which i can try to build PG from source, including your patch adding extra verbosity. please note that the last inconsistency was fixed, but i expect that new one will show up within few days. greetings! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
(On Thu, Jun 24, 2021 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 17/06/2021 22:49, Tom Lane wrote: > > I wrote: > >> Pawel Kudzia <kudzia@gmail.com> writes: > >>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB > >>> or less makes the problem go away: > > > >> Hmm. So that suggests that the index itself is *not* corrupt, > >> but the problem is associated with a bug in the indexscan > >> algorithms. > > > > After staring at the code I think there is at least one bug, and > > possibly two, in shimTriConsistentFn. That's likely to be implicated > > here because intarray's GIN opclass only provides a bool consistent > > function. I'm not very clear on the circumstances that lead to the scan > > code inventing GIN_MAYBE inputs, so I haven't been able to construct a > > test case. > > > > The definite bug is triggered because intarray relies on the API > > specification that says that if the search mode is > > GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called > > when there's at least one TRUE key: > > > > case RTOverlapStrategyNumber: > > /* result is not lossy */ > > *recheck = false; > > /* at least one element in check[] is true, so result = true */ > > res = true; > > break; > > > > shimTriConsistentFn ignores this and calls it on all-FALSE inputs, > > for which it'll incorrectly get a TRUE result, as it will also for > > all the following checks. The upshot is that shimTriConsistentFn > > will convert any case with a MAYBE input to a hard TRUE result, > > with no recheck requirement. This'd easily explain the reported > > misbehavior. > > That's subtle. The documentation says: > > > If <literal>*searchMode</literal> is set to > > <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is > > initialized to before call), only items that match at least one of > > the returned keys are considered candidate matches. > > I guess you can read that as "consistentFn will only be called when > there is at least one matching item", but that didn't occur to me when I > read that at first. But I agree we need to fix shimTriConsistentFn to > respect that. > > > The other thing that I'm unsure about, because the code is horribly > > underdocumented, is that it's not very clear whether > > shimTriConsistentFn is accurately converting between the bool and > > the tristate APIs. That's because it's not very clear what the > > tristate API actually *is*. In particular, is the end state of > > key->recheckCurItem meaningful in the tristate case? If it's not, > > then the short-circuit case for no MAYBE inputs is broken, because > > it'll return TRUE when the bool consistentFn is trying to tell us > > to recheck. But if it is meaningful, then the multiway case is broken, > > because it will return the recheckCurItem result set by the last call to > > the bool consistentfn; which might be false even though other calls said > > true. (So in that case I think we'd need "key->recheckCurItem = recheck" > > at the end.) I also wonder how any of that works correctly for real > > triconsistent functions, which don't have access to the recheckCurItem > > flag. > > > > Anyway, I'm punting this to the code authors. I'd like to see > > some comments clarifying what the API really is, not just a > > quick-n-dirty code fix. > > I've been trying to create a test case for this, but no luck so far. I > cannot find a codepath that would hit these bugs with the kind of query > that Pawel used. The search condition is very simple: "column && > '{constant}'", with only one key and one constant to search for. There > are three calls to triConsistentFn: > > 1. In startScanKey, but that's only reached if (key->nentries > 1), so > not reachable with Pawel's query. > > 2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry > is never true with Pawel's query. That would require curItem to be a > lossy pointer, and that in turn can only happen when matchBitmap is > used, and that's only used with partial match queries and with queries > that need a full-index scan. > > 3. The second call in keyGetItem is reached, but it is only passed > GIN_MAYBE when curItem is lossy. Which isn't possible with this query, > see point 2. > > AFAICS the recheckCurItem bug should also not cause wrong results with > that query. I must be missing something. > > I could probably construct a test case with a different query, one > involving multiple elements in the search key, but I'd like to have a > solid explanation for the original report. +1, I see query provided by Pawel is too simple. Bugs, which Tom digged in shimTriConsistentFn() don't seem to explain the wrong query results provided by Pawel. > Pawel, is this a production system, or would it be possible for you to > build from sources with a patch with some extra debugging output? Do you think it also worth checking whether bug persists when set fastupdate = off. Then we could localize whether bug is related to pending lists. ------ Regards, Alexander Korotkov
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jun 24, 2021 at 3:32 PM Pawel Kudzia <kudzia@gmail.com> wrote: > > Pawel, is this a production system, or would it be possible for you to > > build from sources with a patch with some extra debugging output? > thank you for looking into it! > > it is a production system but we have couple of replicas. when the > problem occurs - it can be reproduced on any replica as well. when > it happens again - i can stop one of replicas, take file level backup > and set up a test machine on which i can try to build PG from source, > including your patch adding extra verbosity. > > please note that the last inconsistency was fixed, but i expect that > new one will show up within few days. hello again, i have a consistent file-level backup of postgresql's /var/lib/postgresql + /etc/postgresql on which i can reproduce the issue reliably. it's on a test machine where we can put patched version of PG. currently this machine is using Debian 13.3-1.pgdg100+1. set enable_seqscan=off; SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; returns me 2 rows while it should return none. also in this case lowering work_mem from 1MB to 256kB makes fixes the issue - SELECT returns 0 rows instead of 2. i'll be happy to run patched version and send you back logs produced by it. greetings! -- regards, Pawel Kudzia
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
( . () w On Tue, Jun 22, 2021 at 7:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Fri, Jun 18, 2021 at 12:55 AM Alexander Korotkov > <aekorotkov@gmail.com> wrote: > > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Anyway, I'm punting this to the code authors. I'd like to see > > > some comments clarifying what the API really is, not just a > > > quick-n-dirty code fix. > > > > Yep, I'm going to have a closer look at this tomorrow. > > Sorry for the delay. I'm going to take a closer look in the next > couple of days. I've closer look at shimTriConsistentFn() function. It looks to me like the function itself has inconsistencies. But the way it's currently used in GIN shouldn't produce the wrong query answers. shimTriConsistentFn() is one of implementation of GinScanKey.triConsistentFn. In turn, GinScanKey.triConsistentFn have 3 callers: 1) startScanKey() to determine required keys 2) keyGetItem() for lossy item check 3) keyGetItem() for normal item check Let's see shimTriConsistentFn() inconsistencies and how callers handle them. 1) shimTriConsistentFn() returns result of directBoolConsistentFn() for zero maybe entries without examining key->recheckCurItem. That may result in returning GIN_TRUE instead of GIN_MAYBE 1.1) startScanKey() doesn't care about recheck, just looking for GIN_FALSE result. 1.2) keyGetItem() for lossy item always implies recheck. 1.3) keyGetItem() for a normal item does the trick. Whether a current item is rechecked is controlled by key->recheckCurItem variable (the same which is set by directBoolConsistentFn(). The following switch sets key->recheckCurItem for GIN_MAYBE, but leaves it untouched for GIN_TRUE. Thus, GIN_TRUE with key->recheckCurItem works here just like GIN_MAYBE. I think this is inconsistent by itself, but this inconsistency compensates for inconsistency in shimTriConsistentFn(). 2) shimTriConsistentFn() might call directBoolConsistentFn() with all false inputs for GIN_SEARCH_MODE_DEFAULT. The result is intended to be false, but opclass consistent method isn't intended to handle this situation correctly. For instance, ginint4_consistent() returns true without checking the input array. That could make shimTriConsistentFn() return GIN_TRUE instead of GIN_MAYBE, or GIN_MAYBE instead of GIN_FALSE. 2.1) In principle, this could lead startScanKey() to select less required entries than possible. But this is definitely not the case of ginint4_consistent() when meeting any of entries is enough for match. 2.2) In principle, keyGetItem() could get false positives for lossy items. But that wouldn't lead to wrong query answers. Again, this is not the case of ginint4_consistent(). 2.3) keyGetItem() for a normal item doesn't call triConsistentFn() with any GIN_MAYBE or all GIN_FALSE. To sum up, I don't see how inconsistencies in shimTriConsistentFn() could lead to wrong query answers, especially in intarray GIN index. Nevertheless, these inconsistencies should be fixed. I'm going to propose a patch soon. ------ Regards, Alexander Korotkov
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 11/07/2021 00:56, Alexander Korotkov wrote: > I've closer look at shimTriConsistentFn() function. It looks to me > like the function itself has inconsistencies. But the way it's > currently used in GIN shouldn't produce the wrong query answers. > > shimTriConsistentFn() is one of implementation of > GinScanKey.triConsistentFn. In turn, GinScanKey.triConsistentFn have > 3 callers: > 1) startScanKey() to determine required keys > 2) keyGetItem() for lossy item check > 3) keyGetItem() for normal item check > > Let's see shimTriConsistentFn() inconsistencies and how callers handle them. > 1) shimTriConsistentFn() returns result of directBoolConsistentFn() > for zero maybe entries without examining key->recheckCurItem. That > may result in returning GIN_TRUE instead of GIN_MAYBE > 1.1) startScanKey() doesn't care about recheck, just looking for > GIN_FALSE result. > 1.2) keyGetItem() for lossy item always implies recheck. > 1.3) keyGetItem() for a normal item does the trick. Whether a current > item is rechecked is controlled by key->recheckCurItem variable (the > same which is set by directBoolConsistentFn(). The following switch > sets key->recheckCurItem for GIN_MAYBE, but leaves it untouched for > GIN_TRUE. Thus, GIN_TRUE with key->recheckCurItem works here just > like GIN_MAYBE. I think this is inconsistent by itself, but this > inconsistency compensates for inconsistency in shimTriConsistentFn(). > 2) shimTriConsistentFn() might call directBoolConsistentFn() with all > false inputs for GIN_SEARCH_MODE_DEFAULT. The result is intended to > be false, but opclass consistent method isn't intended to handle this > situation correctly. For instance, ginint4_consistent() returns true > without checking the input array. That could make > shimTriConsistentFn() return GIN_TRUE instead of GIN_MAYBE, or > GIN_MAYBE instead of GIN_FALSE. > 2.1) In principle, this could lead startScanKey() to select less > required entries than possible. But this is definitely not the case > of ginint4_consistent() when meeting any of entries is enough for > match. > 2.2) In principle, keyGetItem() could get false positives for lossy > items. But that wouldn't lead to wrong query answers. Again, this is > not the case of ginint4_consistent(). > 2.3) keyGetItem() for a normal item doesn't call triConsistentFn() > with any GIN_MAYBE or all GIN_FALSE. > > To sum up, I don't see how inconsistencies in shimTriConsistentFn() > could lead to wrong query answers, especially in intarray GIN index. Agreed, I came to the same conclusion looking at the code. Which means that we still don't have an explanation for the original bug report :-(. > Nevertheless, these inconsistencies should be fixed. I'm going to > propose a patch soon. Thanks! I came up with the attached patch. I changed directBoolConsistentFn() to return a GinTernaryValue rather than bool, I think that makes the logic in shimTriConsistentFn() more clear. I also tried to write a test case to expose issue 2.1 above, but couldn't come up with an example. - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 05/07/2021 16:36, Pawel Kudzia wrote: > i have a consistent file-level backup of postgresql's /var/lib/postgresql + > /etc/postgresql on which i can reproduce the issue reliably. it's on a test > machine where we can put patched version of PG. currently this machine > is using Debian 13.3-1.pgdg100+1. > > set enable_seqscan=off; > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) > AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; > > returns me 2 rows while it should return none. > > also in this case lowering work_mem from 1MB to 256kB makes fixes the > issue - SELECT returns 0 rows instead of 2. > > i'll be happy to run patched version and send you back logs produced by it. Thanks! Here's a patch that prints lines with "GINBUG:" prefix in the log. It should apply cleanly to PostgreSQL v13. This should help to confirm whether the bugs in shimTriConsistentFn that Tom pointed out are the root cause of this issue. It should also tell us whether the rows are being returned from the pending-inserts list or the regular part of the GIN index. We might need a few iterations to test different theories, but we'll get there... - Heikki Ps. Sorry for the delay, I didn't see you reply until now, it went into spam folder
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Sun, Jul 11, 2021 at 8:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 05/07/2021 16:36, Pawel Kudzia wrote: > > i have a consistent file-level backup of postgresql's /var/lib/postgresql + > > /etc/postgresql on which i can reproduce the issue reliably. it's on a test > > machine where we can put patched version of PG. currently this machine > > is using Debian 13.3-1.pgdg100+1. > > > > set enable_seqscan=off; > > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) > > AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; > > > > returns me 2 rows while it should return none. > > > > also in this case lowering work_mem from 1MB to 256kB makes fixes the > > issue - SELECT returns 0 rows instead of 2. > > > > i'll be happy to run patched version and send you back logs produced by it. > > Thanks! Here's a patch that prints lines with "GINBUG:" prefix in the > log. It should apply cleanly to PostgreSQL v13. > > This should help to confirm whether the bugs in shimTriConsistentFn that > Tom pointed out are the root cause of this issue. It should also tell us > whether the rows are being returned from the pending-inserts list or the > regular part of the GIN index. > > We might need a few iterations to test different theories, but we'll get > there... > Heikki - thank you for the patch! it applied cleanly. i've run these two commands: SET enable_seqscan=off; SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; and got this in the output: 2021-07-12 07:41:05 UTC LOG: GINBUG: startScanKey called: excludeOnly 0 nentries 1 2021-07-12 07:41:05 UTC STATEMENT: SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; 2021-07-12 07:41:05 UTC LOG: GINBUG: called triConsistentFn(1): 1 0 the "STATEMENT" line is repeated 79586 times, the "GINBUG: called triConsistentFn" - 79585 times. there's nothing else in the log besides information about server startup. i'll be happy to help with the follow up checks. > Ps. Sorry for the delay, I didn't see you reply until now, it went into > spam folder no worries! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 12/07/2021 20:24, Pawel Kudzia wrote: > i've run these two commands: > > SET enable_seqscan=off; > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) > AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; > > and got this in the output: > > 2021-07-12 07:41:05 UTC LOG: GINBUG: startScanKey called: excludeOnly > 0 nentries 1 > > 2021-07-12 07:41:05 UTC STATEMENT: SELECT entity_id FROM entity WHERE > ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) > && '{1737}') LIMIT 10; > 2021-07-12 07:41:05 UTC LOG: GINBUG: called triConsistentFn(1): 1 0 > > > the "STATEMENT" line is repeated 79586 times, the "GINBUG: called > triConsistentFn" - 79585 times. > > there's nothing else in the log besides information about server startup. > > i'll be happy to help with the follow up checks. Ok, that confirms that it's taking the simple path through triConsistentFn(), with a single non-lossy key. Also, there don't seem to be any pending inserts. Now that we've ruled those out, I have no clue what the problem might be. Is the data sensitive, or you could you share it? I realize it's a pretty large database, but if I had a copy I could poke around to look at the index pages involved here, to see if they're corrupt and how. If that's possible, you can contact me directly, off-list, to figure the logistics to send over a copy of the data directory. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Tue, Jul 13, 2021 at 3:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Is the data sensitive, or you could you share it? I realize it's a > pretty large database, but if I had a copy I could poke around to look > at the index pages involved here, to see if they're corrupt and how. If > that's possible, you can contact me directly, off-list, to figure the > logistics to send over a copy of the data directory. Worth noting that the Wiki describes a number two ways to do this when reporting a bug -- using gdb, and using contrib/pageinspect: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB -- Peter Geoghegan
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Wed, Jul 14, 2021 at 1:03 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Jul 13, 2021 at 3:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Is the data sensitive, or you could you share it? I realize it's a > > pretty large database, but if I had a copy I could poke around to look > > at the index pages involved here, to see if they're corrupt and how. If > > that's possible, you can contact me directly, off-list, to figure the > > logistics to send over a copy of the data directory. > > Worth noting that the Wiki describes a number two ways to do this when > reporting a bug -- using gdb, and using contrib/pageinspect: > > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB > Heikki - the data is semi-sensitive. I cannot share the whole table. I could if it was possible to remove some columns or isolate only handful of rows. But - just doing the "vacuum full entity" resolves the original issue and select returns 0 rows, as it should. How can I identify relevant pages to make selective dump of them for you, using method proposed by Peter? Would interactive session where you can get temporary access to the server with copy of the database, to run gdb on it be of any help? Greetings! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 14/07/2021 17:37, Pawel Kudzia wrote: > Heikki - the data is semi-sensitive. I cannot share the whole table. > I could if it was possible to remove some columns or isolate only > handful of rows. > > But - just doing the "vacuum full entity" resolves the original issue > and select returns 0 rows, as it should. Ok. One idea would be to write a program or extension that overwrites all the uninteresting columns with zeros. > How can I identify relevant pages to make selective dump of them for > you, using method proposed by Peter? It's a bit hard without knowing which index pages are affected. But the attached script (dump-gin-page-info.sql) dumps some summary information about the kind of pages there are. That's a start. > Would interactive session where you can get temporary access to the > server with copy of the database, > to run gdb on it be of any help? Let's try one thing first: I rebased a patch to add GIN support to the amcheck extension [1]. I kind of doubt that it will catch the corruption in this case, but you never know. So please apply the attached v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I posted at [1], but the installation script is slightly different to make it work on PostgreSQL v13. Then compile install it: cd contrib/amcheck; make install Then in psql: CREATE EXTENSION amcheck; gin_index_parent_check('entity_attribute_name_ids_gin'); If it reports any inconsistencies, we can try to use pageinspect to gather further evidence on them. [1] https://www.postgresql.org/message-id/a5bc2aad-464b-91bd-061d-28af0f9b634c%40iki.fi - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
> > How can I identify relevant pages to make selective dump of them for > > you, using method proposed by Peter? > > It's a bit hard without knowing which index pages are affected. But the > attached script (dump-gin-page-info.sql) dumps some summary information > about the kind of pages there are. That's a start. I'm attaching output of the proposed commands. > Let's try one thing first: I rebased a patch to add GIN support to the > amcheck extension [1]. I kind of doubt that it will catch the corruption > in this case, but you never know. So please apply the attached > v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I > posted at [1], but the installation script is slightly different to make > it work on PostgreSQL v13. Then compile install it: > > cd contrib/amcheck; make install > > Then in psql: > > CREATE EXTENSION amcheck; > gin_index_parent_check('entity_attribute_name_ids_gin'); Sadly i'm getting this when executing CREATE EXTENSION amcheck; ERROR: extension "amcheck" has no installation script nor update path for version "1.2.1" Thank you! -- regards, Pawel Kudzia
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Do you think it also worth checking whether bug persists when set > fastupdate = off. Then we could localize whether bug is related to > pending lists. I still think this is worth checking. Despite the pending list wasn't involved in the index scan with wrong results, the bug could be related to insertion to the pending list. Or it could be related to moving entries from the pending list to the posting list/tree. ------ Regards, Alexander Korotkov
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
> On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov > <aekorotkov@gmail.com> wrote: > > Do you think it also worth checking whether bug persists when set > > fastupdate = off. Then we could localize whether bug is related to > > pending lists. > > I still think this is worth checking. Despite the pending list wasn't > involved in the index scan with wrong results, the bug could be > related to insertion to the pending list. Or it could be related to > moving entries from the pending list to the posting list/tree. > regarding fastupdate - i'd like to clarify. do i understand correctly that this parameter affects what happens when rows are inserted/updated/deleted? if yes - we no longer have such workload on the production and we cannot anymore and i was never able to find reproducible scenario. the production environment was moved away from index built "USING gin (attribute_name_ids public.gin__int_ops)" to index "USING gin (attribute_name_ids)". the only thing i have right two file-level backups of postgresql data directory on which i know that SELECTs return rows that actually don't meet provided criteria. is there any point in testing fastupdate = off on those two snapshots? -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Alexander Korotkov
Date:
On Thu, Jul 15, 2021 at 4:27 PM Pawel Kudzia <kudzia@gmail.com> wrote: > > On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov > > <aekorotkov@gmail.com> wrote: > > > Do you think it also worth checking whether bug persists when set > > > fastupdate = off. Then we could localize whether bug is related to > > > pending lists. > > > > I still think this is worth checking. Despite the pending list wasn't > > involved in the index scan with wrong results, the bug could be > > related to insertion to the pending list. Or it could be related to > > moving entries from the pending list to the posting list/tree. > > > > regarding fastupdate - i'd like to clarify. do i understand correctly > that this parameter affects what happens when rows are > inserted/updated/deleted? Yes, that's it. > if yes - we no longer have such workload on the production and we > cannot anymore and i was never able to find reproducible scenario. > the production environment was moved away from index built "USING gin > (attribute_name_ids public.gin__int_ops)" to index "USING gin > (attribute_name_ids)". Do I understand correctly that after the production environment moved away from the usage of contrib/intarray opclass to builtin opclass, the error has gone? > the only thing i have right two file-level backups of postgresql data > directory on which i know that SELECTs return rows that actually don't > meet provided criteria. > > is there any point in testing fastupdate = off on those two snapshots? OK, I see. No point in trying fastupdate = off unless we can try to reproduce the index corruption. ------ Regards, Alexander Korotkov
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jul 15, 2021 at 4:27 PM Pawel Kudzia <kudzia@gmail.com> wrote:
> > On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov
> > <aekorotkov@gmail.com> wrote:
> > > Do you think it also worth checking whether bug persists when set
> > > fastupdate = off. Then we could localize whether bug is related to
> > > pending lists.
> >
> > I still think this is worth checking. Despite the pending list wasn't
> > involved in the index scan with wrong results, the bug could be
> > related to insertion to the pending list. Or it could be related to
> > moving entries from the pending list to the posting list/tree.
> >
>
> regarding fastupdate - i'd like to clarify. do i understand correctly
> that this parameter affects what happens when rows are
> inserted/updated/deleted?
Yes, that's it.
> if yes - we no longer have such workload on the production and we
> cannot anymore and i was never able to find reproducible scenario.
> the production environment was moved away from index built "USING gin
> (attribute_name_ids public.gin__int_ops)" to index "USING gin
> (attribute_name_ids)".
Do I understand correctly that after the production environment moved
away from the usage of contrib/intarray opclass to builtin opclass,
the error has gone?
it's a bit too early to be confident - i'd give it at least 2-3 more weeks before stating that the problem was solved.
in the past, when using gin__int_ops, we had periods of few consecutive days when we did not catch any inconsistent reposnes.
> the only thing i have right two file-level backups of postgresql data
> directory on which i know that SELECTs return rows that actually don't
> meet provided criteria.
>
> is there any point in testing fastupdate = off on those two snapshots?
OK, I see. No point in trying fastupdate = off unless we can try to
reproduce the index corruption.
thx for the clarification!
i'm happy to help with running more diagnostics on the database dumps that i've saved, where very specific SELECTs return rows that don't match provided criteria.
greetings!
regards,
Pawel Kudzia
Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 15/07/2021 12:38, Pawel Kudzia wrote: >>> How can I identify relevant pages to make selective dump of them for >>> you, using method proposed by Peter? >> >> It's a bit hard without knowing which index pages are affected. But the >> attached script (dump-gin-page-info.sql) dumps some summary information >> about the kind of pages there are. That's a start. > > I'm attaching output of the proposed commands. Thanks! >> Let's try one thing first: I rebased a patch to add GIN support to the >> amcheck extension [1]. I kind of doubt that it will catch the corruption >> in this case, but you never know. So please apply the attached >> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I >> posted at [1], but the installation script is slightly different to make >> it work on PostgreSQL v13. Then compile install it: >> >> cd contrib/amcheck; make install >> >> Then in psql: >> >> CREATE EXTENSION amcheck; >> gin_index_parent_check('entity_attribute_name_ids_gin'); > > Sadly i'm getting this when executing CREATE EXTENSION amcheck; > > ERROR: extension "amcheck" has no installation script nor update path > for version "1.2.1" Hmm, that's odd, it worked for me. Did "make install" copy the "amcheck--1.2--1.2.1.sql" file to the right place? It should be installed the same directory as "amcheck.control". Anyway, let's try a different tack. Here's another debugging patch. It causes a LOG message to be printed whenever a GIN index page is accessed, like this: 2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035 read gin blk 0 (ginget.c:1832 scanPendingInsert) 2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select * from test_intarray where a @> '{2}'; 2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035 read gin blk 1 (ginbtree.c:89 ginFindLeafPage) 2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select * from test_intarray where a @> '{2}'; Please apply this patch, run the query again, and copy the resulting log. I'm also interested in the contents of the pages that are accessed, so if you could then run these commands to dump the contents of those index pages: # extract just the block numbers from the log cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print "$1\n" ' | sort -n |uniq > /tmp/blocknums # dump contents of each of those index pages to '/tmp/block-contents' psql postgres -c "create temp table blocknums(blknum int); copy blocknums from '/tmp/blocknums'; copy (select blknum, get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to '/tmp/block-contents';" Please send over the resulting PostgreSQL log, and '/tmp/block-contents', and I'll take a closer look into the contents to see if there's something funny. (A private email is OK if you don't feel comfortable sharing it with the world). One more thing, please also add the 'ctid' virtual column to the query, like this: SELECT ctid, entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; Also, did you happen to keep the WAL log from around the time that the query started producing wrong results? That would also be useful evidence for how it got into this situation. Thanks in advance! We might need a few more round-trips, but we'll get there. - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/07/2021 12:38, Pawel Kudzia wrote:
>>> How can I identify relevant pages to make selective dump of them for
>>> you, using method proposed by Peter?
>>
>> It's a bit hard without knowing which index pages are affected. But the
>> attached script (dump-gin-page-info.sql) dumps some summary information
>> about the kind of pages there are. That's a start.
>
> I'm attaching output of the proposed commands.
Thanks!
>> Let's try one thing first: I rebased a patch to add GIN support to the
>> amcheck extension [1]. I kind of doubt that it will catch the corruption
>> in this case, but you never know. So please apply the attached
>> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
>> posted at [1], but the installation script is slightly different to make
>> it work on PostgreSQL v13. Then compile install it:
>>
>> cd contrib/amcheck; make install
>>
>> Then in psql:
>>
>> CREATE EXTENSION amcheck;
>> gin_index_parent_check('entity_attribute_name_ids_gin');
>
> Sadly i'm getting this when executing CREATE EXTENSION amcheck;
>
> ERROR: extension "amcheck" has no installation script nor update path
> for version "1.2.1"
Hmm, that's odd, it worked for me. Did "make install" copy the
"amcheck--1.2--1.2.1.sql" file to the right place? It should be
installed the same directory as "amcheck.control".
thank you for the hint. i've messed up patching. now i can get one step further:
data=# CREATE EXTENSION amcheck;
CREATE EXTENSION
data=# gin_index_parent_check('entity_attribute_name_ids_gin');
ERROR: syntax error at or near "gin_index_parent_check"
LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
^
CREATE EXTENSION
data=# gin_index_parent_check('entity_attribute_name_ids_gin');
ERROR: syntax error at or near "gin_index_parent_check"
LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
^
i've also applied trace-gin-readbuffer-2.patch
Anyway, let's try a different tack. Here's another debugging patch. It
causes a LOG message to be printed whenever a GIN index page is
accessed, like this:
2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035
read gin blk 0 (ginget.c:1832 scanPendingInsert)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select
* from test_intarray where a @> '{2}';
2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035
read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select
* from test_intarray where a @> '{2}';
Please apply this patch, run the query again, and copy the resulting
log.
that'll be the file called "log", send directly to your e-mail address.
I'm also interested in the contents of the pages that are accessed,
so if you could then run these commands to dump the contents of those
index pages:
# extract just the block numbers from the log
cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print
"$1\n" ' | sort -n |uniq > /tmp/blocknums
# dump contents of each of those index pages to '/tmp/block-contents'
psql postgres -c "create temp table blocknums(blknum int); copy
blocknums from '/tmp/blocknums'; copy (select blknum,
get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to
'/tmp/block-contents';"
Please send over the resulting PostgreSQL log, and
'/tmp/block-contents', and I'll take a closer look into the contents to
see if there's something funny. (A private email is OK if you don't feel
comfortable sharing it with the world).
One more thing, please also add the 'ctid' virtual column to the query,
like this:
SELECT ctid, entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;
that's how the queries above were run. output:
ctid | entity_id
--------------+-----------
(4002784,1) | 38048120
(4002869,14) | 95333744
(2 rows)
--------------+-----------
(4002784,1) | 38048120
(4002869,14) | 95333744
(2 rows)
Also, did you happen to keep the WAL log from around the time that the
query started producing wrong results? That would also be useful
evidence for how it got into this situation.
sadly i don't. as noted earlier - we've already switched the production setup so i have no way of leading to corruption again, i only have two file-level backups of the database with two different inconsistencies in the SELECTs.
greetings!
regards,
Pawel Kudzia
Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > I still think this is worth checking. Despite the pending list wasn't > involved in the index scan with wrong results, the bug could be > related to insertion to the pending list. Or it could be related to > moving entries from the pending list to the posting list/tree. If I had to guess I'd say that the chances of the pending list being involved are high. shiftList() deletes pages in the pending list -- this is called from ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid() to set an XID that represents when the page is safe to recycle, which is what ginDeletePage() always does. Why is that okay? Note that we usually use read/share buffer locks when lock-coupling inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent readers with a link that's about to go stale due to recycling of the page. This looks unsafe. Of course it's very hard to tell what might be going on, since none of this seems to be explained anywhere. Even ginDeletePage() didn't do the right thing with XIDs until bugfix commit 52ac6cd2d0. That commit didn't touch any pending list code -- it should of at least explained why ginInsertCleanup()/shiftList() don't need to use the GinPageSetDeleteXid() stuff. -- Peter Geoghegan
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 17/07/2021 02:40, Peter Geoghegan wrote: > On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> I still think this is worth checking. Despite the pending list wasn't >> involved in the index scan with wrong results, the bug could be >> related to insertion to the pending list. Or it could be related to >> moving entries from the pending list to the posting list/tree. > > If I had to guess I'd say that the chances of the pending list being > involved are high. > > shiftList() deletes pages in the pending list -- this is called from > ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid() > to set an XID that represents when the page is safe to recycle, which > is what ginDeletePage() always does. Why is that okay? > > Note that we usually use read/share buffer locks when lock-coupling > inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent > readers with a link that's about to go stale due to recycling of the > page. This looks unsafe. Of course it's very hard to tell what might > be going on, since none of this seems to be explained anywhere. > > Even ginDeletePage() didn't do the right thing with XIDs until bugfix > commit 52ac6cd2d0. That commit didn't touch any pending list code -- > it should of at least explained why ginInsertCleanup()/shiftList() > don't need to use the GinPageSetDeleteXid() stuff. Hmm, seems we should fix that. But could a prematurely recycled deleted page cause permanent corruption? Here's what I've found out so far, looking at the traces Pawel sent: > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 0 (ginutil.c:636 ginGetStats) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 0 (ginget.c:1866 scanPendingInsert) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 383295 (ginget.c:1894 scanPendingInsert) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 383298 (ginget.c:1518 scanGetCandidate) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 2 (ginget.c:1518 scanGetCandidate) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 3 (ginget.c:1518 scanGetCandidate) > [few hundred calls from scanGetCandidate omitted] > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 200586 (ginget.c:1518 scanGetCandidate) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 200854 (ginget.c:1518 scanGetCandidate) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 1 (ginbtree.c:89 ginFindLeafPage) > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 1452 (ginbtree.c:89 ginFindLeafPage) > 2021-07-16 07:01:19 UTC LOG: GINBUG: startScanKey called: excludeOnly 0 nentries 1 > 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0 > 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0 > 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0 > ... So, it begins by reading the metadata page, and then it scans the pending list. There are a few hundred pages in the pending list, but no matches are found there. Then, it reads the root page at blk 1. And then, it reads page 1452. I used pg_filedump on the root page (blk 1): > Block 1 ******************************************************** > <Header> ----- > Block Offset: 0x00002000 Offsets: Lower 1376 (0x0560) > Block: Size 8192 Version 4 Upper 2776 (0x0ad8) > LSN: logid 410 recoff 0x04c6b6c0 Special 8184 (0x1ff8) > Items: 338 Free Space: 1400 > Checksum: 0xde29 Prune XID: 0x00000000 Flags: 0x0000 () > Length (including item array): 1376 > > <Data> ----- > Item 1 -- Length: 16 Offset: 7736 (0x1e38) Flags: NORMAL > Item 2 -- Length: 16 Offset: 8168 (0x1fe8) Flags: NORMAL > Item 3 -- Length: 16 Offset: 7800 (0x1e78) Flags: NORMAL > ... > Item 337 -- Length: 16 Offset: 7032 (0x1b78) Flags: NORMAL > Item 338 -- Length: 16 Offset: 8152 (0x1fd8) Flags: NORMAL > > <Special Section> ----- > GIN Index Section: > Flags: 0x00000000 () Maxoff: 0 > Blocks: RightLink (-1) It has no Flags set, which means that it is an internal entry tree page. The items on an internal entry tree page should point to other entry tree pages. Now let's take a look at the next page that it read, blk 1452: > Block 1452 ******************************************************** > <Header> ----- > Block Offset: 0x00b58000 Offsets: Lower 562 (0x0232) > Block: Size 8192 Version 4 Upper 8184 (0x1ff8) > LSN: logid 471 recoff 0xf0cf8a68 Special 8184 (0x1ff8) > Items: 134 Free Space: 7622 > Checksum: 0x1628 Prune XID: 0x00000000 Flags: 0x0000 () > Length (including item array): 560 > > <Data> ----- > Item 1 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED > Item 2 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED > Item 3 -- Length: 7780 Offset: 0 (0x0000) Flags: UNUSED > Item 4 -- Length: 5990 Offset: 4 (0x0004) Flags: UNUSED > Item 5 -- Length: 0 Offset: 6 (0x0006) Flags: UNUSED > Item 6 -- Length: 2 Offset: 15559 (0x3cc7) Flags: UNUSED > Error: Item contents extend beyond block. > BlockSize<8192> Bytes Read<8192> Item Start<15561>. > Item 7 -- Length: 3 Offset: 15372 (0x3c0c) Flags: REDIRECT > ... > Item 133 -- Length: 20913 Offset: 3 (0x0003) Flags: UNUSED > Error: Item contents extend beyond block. > BlockSize<8192> Bytes Read<8192> Item Start<20916>. > Item 134 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED > > <Special Section> ----- > GIN Index Section: > Flags: 0x00000001 (DATA) Maxoff: 53 > Blocks: RightLink (-1) On this page, the DATA flag is set, so it is an internal *posting* tree page. That's weird: the scan walked straight from an internal entry tree page (root, at blk 1) into an internal posting tree page (blk 1452). That doesn't make sense to me. The next ReadBuffer call is this: > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight) Where did block 15559 come from? How come we're stepping right to it? It's not the right sibling of the previously accessed page, 1452. In fact, 15559 is a leaf posting tree page. I don't understand how that sequence of page reads could happen. After that, the log shows that it follows the right-links from block 15559 onward. The incorrectly returned entries are on posting tree page 255179. The scan reaches that page by following that chain of right-links. I'm going to continue investigating this on Monday. I'll take a closer look at the index entries on those blocks to see if I can make sense of it. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Fri, Jul 16, 2021 at 5:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, seems we should fix that. But could a prematurely recycled deleted > page cause permanent corruption? If scans can find a page that is wholly unrelated to the expected page (and possibly even in the wrong high level page category), then it's really hard to predict what might happen. This could lead to real chaos. ginInsertCleanup() makes no attempt to perform basic validation of its assumptions about what kind of page this is, except for some assertions. We should have something like a "can't happen" error on !GinPageIsList() inside ginInsertCleanup() -- if we had that already then I might be able to reason about this problem. It wouldn't hurt to have similar checks in other code that deals with posting trees and entry trees, too. ginInsertCleanup() is tolerant of all kinds of things. It's not just the lack of page-level sanity checks. It's also the basic approach to crash safety, which relies on the fact that GIN only does lossy index scans. My guess is that there could be lots of problems without it being obvious to users. Things really went downhill in ginInsertCleanup() starting in commit e956808328. > On this page, the DATA flag is set, so it is an internal *posting* tree > page. > > That's weird: the scan walked straight from an internal entry tree page > (root, at blk 1) into an internal posting tree page (blk 1452). That > doesn't make sense to me. I agree that the internal entry tree page (root, at blk 1) looks sane, from what I've seen. The tuple sizes are plausible -- 16 byte index tuples aren't possible on an entry tree leaf page. Nor in a pending list page. Anyway, this is roughly the kind of bug I had in mind. It's possible that the underlying problem doesn't actually involve ginInsertCleanup() -- as I said we have seen similar issues elsewhere (one such issue was fixed in commit 52ac6cd2d0). But as Alexander pointed out, that doesn't mean much. It's possible that this problem is 1 or 2 problems removed from the original problem, which really did start in ginInsertCleanup() -- who knows? Why shouldn't corruption lead to more corruption, given that we don't do much basic page level validation? We do at least sanitize within ginStepRight(), but we need to be more consistent about it. > The next ReadBuffer call is this: > > > 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight) > > Where did block 15559 come from? How come we're stepping right to it? > It's not the right sibling of the previously accessed page, 1452. In > fact, 15559 is a leaf posting tree page. I don't understand how that > sequence of page reads could happen. Maybe take a look at Block 1452 using pg_hexedit? pg_hexedit is designed to do well at interpreting quasi-corrupt data (or at least allowing the user to do so). We see from your pg_filedump output that the tuple contents for the page are totally wild. We should not trust the reported right sibling page, given everything else -- is that really what Postgres thinks the right sibling is? I mean, clearly it doesn't. I think it's possible that pg_filedump is interpreting it in a way that is kind of wrong. If you saw the same page (1452) in pg_hexedit you might spot a pattern that pg_filedump output will never reveal. At least looking at the raw bytes might give you some idea. -- Peter Geoghegan
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 16/07/2021 10:06, Pawel Kudzia wrote: > > > On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > > thank you for the hint. i've messed up patching. now i can get one step > further: > > data=# CREATE EXTENSION amcheck; > CREATE EXTENSION > data=# gin_index_parent_check('entity_attribute_name_ids_gin'); > ERROR: syntax error at or near "gin_index_parent_check" > LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin'); > ^ Ah, sorry, you need to call it with SELECT, like: SELECT gin_index_parent_check('entity_attribute_name_ids_gin'); > i've also applied trace-gin-readbuffer-2.patch Darn, I missed one function that's used to read a page when descending the GIN tree. That explains the seemingly nonsensical accesses in the log - the trace left out some crucial accesses. Attached is a new version of that debugging patch. Please repeat the same steps as before with this: 1. Apply the patch (remove previous patch first if necessary) 2. Compile Postgres, "make install" 3. Clear the log, start postgres 4. Run the query again: SELECT ctid, entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10; 5. Stop the server. 6. Extract the content of the accessed index blocks: cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print "$1\n" ' | sort -n |uniq > /tmp/blocknums psql data -c "create temp table blocknums(blknum int); copy blocknums from '/tmp/blocknums'; copy (select blknum, get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to '/tmp/block-contents';" 7. Send over /tmp/blocknums, /tmp/block-contents and the postgres log again. Thank you for your patience! - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 17/07/2021 04:56, Peter Geoghegan wrote: > On Fri, Jul 16, 2021 at 5:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Hmm, seems we should fix that. But could a prematurely recycled deleted >> page cause permanent corruption? > > If scans can find a page that is wholly unrelated to the expected page > (and possibly even in the wrong high level page category), then it's > really hard to predict what might happen. This could lead to real > chaos. Right. Looking at the code, I don't think that can happen. The code that deals with the pending list uses lock coupling, so it should never follow a pointer to a page that was concurrently deleted. Might something funny happen if a pending list page is deleted and immediately reused as another kind of page? I don't see a problem with that. And a gin btree page won't be reused prematurely as a pending list page, because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is being reused as a pending list page. > ginInsertCleanup() makes no attempt to perform basic validation > of its assumptions about what kind of page this is, except for some > assertions. We should have something like a "can't happen" error on > !GinPageIsList() inside ginInsertCleanup() -- if we had that already > then I might be able to reason about this problem. It wouldn't hurt to > have similar checks in other code that deals with posting trees and > entry trees, too. +1 While playing with the trace Pawel sent, I loaded some of the index pages into a local instance so that I could step through the code with gdb. I filled the rest of the pages with zeros. I managed to get the gin btree descend code into an infinite loop with that. When it followed a pointer to an entry leaf page, but that entry leaf page was substituted with zeros, it interpreted the all-zeros page as a page with a right-link to block 0. It then stepped right to read block 0 - which is the metapage! - and incorrectly interpreted it as an internal page. Amazingly, it managed to perform the binary search on it as if it contained index tuples, and came up wth an another block number, which I had also replaced with all-zeros. Long story short, we should sprinkle a lot more sanity checks to the functions in ginbtree.c. It should check that the page is of expected kind; no stepping from internal entry page to posting tree page. And seeing a pointer to block 0 at any step should be an error. But this was just an artifact of the missing pages my test. It doesn't explain the original problem. > ginInsertCleanup() is tolerant of all kinds of things. It's not just > the lack of page-level sanity checks. It's also the basic approach to > crash safety, which relies on the fact that GIN only does lossy index > scans. My guess is that there could be lots of problems without it > being obvious to users. Things really went downhill in > ginInsertCleanup() starting in commit e956808328. Yeah, I wouldn't be surprised if this turns out to be a bug in pending list cleanup. But I haven't been able to come up with a concrete sequence of steps that would cause it. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Sat, Jul 17, 2021 at 2:27 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Right. Looking at the code, I don't think that can happen. The code that > deals with the pending list uses lock coupling, so it should never > follow a pointer to a page that was concurrently deleted. Might > something funny happen if a pending list page is deleted and immediately > reused as another kind of page? I don't see a problem with that. I don't immediately see a problem either, but that hardly means much -- I can't honestly claim that I understand the code here. Why does the code in ginInsertCleanup() use lock coupling? It's probably true that if you don't couple pending list pages then the tests would break in an obvious way, but that tells us nothing about the design itself. It was probably self-evident what was going on in earlier versions of the code, back when it could only run in VACUUM anyway, when it didn't eagerly recycle pending list pages, etc. But it's far from obvious now that we have all that stuff too. Let's say processPendingPage() finds a page that is fundamentally of the wrong kind (i.e. not a pending page). Why shouldn't the result be something scary (some kind of data corruption), that doesn't come with an error or hard crash? I can see no reason. Small differences in the format of the tuples seem unlikely to be caught anywhere, at least offhand. Another possible consequence of the lack of sanity checking is that whole *groups* of sibling pages (say from a posting tree or the entry tree) that look like they're a group from the pending list (but are actually from some tree) could get corrupted, one by one. This could happen as a secondary effect of the initial corruption, which might have only affected one page at first. Why shouldn't ginInsertCleanup() do that? Almost all GIN pages use "GinPageGetOpaque(page)->rightlink" in one way or another. If this is what happened then there really is no telling what you'll end up with. You may have a very hard time zeroing in on the truly relevant corruption. > And a > gin btree page won't be reused prematurely as a pending list page, > because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is > being reused as a pending list page. It doesn't matter that ginNewBuffer() checks GinPageIsRecyclable(), since the page header's pd_prune_xid field will be 0 for a pending list page (same as other index AMs, though not the same as other GIN code). Note that GinPageIsRecyclable() returns "true" (safe to recycle) when it sees any "pd_prune_xid = 0" page, by design. Also, if ginInsertCleanup() ever tries to process/delete a posting tree or entry tree page, then it's still not going to set pd_prune_xid in the first place (because it believes that they're pending list pages). > While playing with the trace Pawel sent, I loaded some of the index > pages into a local instance so that I could step through the code with > gdb. I filled the rest of the pages with zeros. I managed to get the gin > btree descend code into an infinite loop with that. When it followed a > pointer to an entry leaf page, but that entry leaf page was substituted > with zeros, it interpreted the all-zeros page as a page with a > right-link to block 0. It then stepped right to read block 0 - which is > the metapage! - and incorrectly interpreted it as an internal page. > Amazingly, it managed to perform the binary search on it as if it > contained index tuples, and came up wth an another block number, which I > had also replaced with all-zeros. That doesn't actually surprise me. I have deliberately corrupted a lot of test data in all kinds of ways. I have some idea how broken things can be internally without breaking in a very obvious way. > Long story short, we should sprinkle a lot more sanity checks to the > functions in ginbtree.c. It should check that the page is of expected > kind; no stepping from internal entry page to posting tree page. And > seeing a pointer to block 0 at any step should be an error. Definitely. > Yeah, I wouldn't be surprised if this turns out to be a bug in pending > list cleanup. But I haven't been able to come up with a concrete > sequence of steps that would cause it. I also wouldn't be surprised if it was some unrelated issue, despite everything. As broken as I think the code in ginInsertCleanup() is, it has managed to be broken without anybody noticing for quite a while on a few occasions already. So unfortunately all of my observations about ginInsertCleanup() may in the end not help you to find the true underlying bug. -- Peter Geoghegan
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Sat, Jul 17, 2021 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > > And a > > gin btree page won't be reused prematurely as a pending list page, > > because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is > > being reused as a pending list page. > > It doesn't matter that ginNewBuffer() checks GinPageIsRecyclable(), > since the page header's pd_prune_xid field will be 0 for a pending > list page (same as other index AMs, though not the same as other GIN > code). Note that GinPageIsRecyclable() returns "true" (safe to > recycle) when it sees any "pd_prune_xid = 0" page, by design. Also, if > ginInsertCleanup() ever tries to process/delete a posting tree or > entry tree page, then it's still not going to set pd_prune_xid in the > first place (because it believes that they're pending list pages). To be clear, I'm not worried about premature reuse of some other page as a pending list page -- that should be fine. I'm worried about unsafe early reuse of pending list pages, where the new page image is related to either a posting tree or the entry tree. I think that it's possible that the whole locking protocol can't handle premature recycling, because it wasn't really built with that in mind. shiftList() will call RecordFreeIndexPage() for pages it deletes as of commit e956808328 from 2015. I find the code hard to follow, but I suspect that some place cannot handle it if a marked-GIN_DELETED-by-shiftList() page gets concurrently recycled. Distant code may need to still have the same right link (and even the same page contents?) to handle everything. For example, distant code like scanGetCandidate() from ginget.c participates in buffer lock coupling of pending list pages, and so has interactions with ginInsertCleanup() and its own lock coupling. But scanGetCandidate() hasn't changed since 2009, and might well have been missed by the numerous bug fixes that changed scanGetCandidate() code following commit e956808328 from 2015. Nearby code in scanPendingInsert() also hasn't been changed since 2009. But ginInsertCleanup() *has* changed since that time, by quite a bit. Starting with making shiftList() call RecordFreeIndexPage() for pages it deletes. But other stuff too. I might be mistaken in certain details, but the fact that it's so hard to keep all the details straight concerns me. -- Peter Geoghegan
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Sat, Jul 17, 2021 at 10:51 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 16/07/2021 10:06, Pawel Kudzia wrote: > > > > > > On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi > > <mailto:hlinnaka@iki.fi>> wrote: > > > > thank you for the hint. i've messed up patching. now i can get one step > > further: > > > > data=# CREATE EXTENSION amcheck; > > CREATE EXTENSION > > data=# gin_index_parent_check('entity_attribute_name_ids_gin'); > > ERROR: syntax error at or near "gin_index_parent_check" > > LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin'); > > ^ > > Ah, sorry, you need to call it with SELECT, like: > > SELECT gin_index_parent_check('entity_attribute_name_ids_gin'); > That worked. It gave me gin_index_parent_check.txt, attached. > > i've also applied trace-gin-readbuffer-2.patch > > Darn, I missed one function that's used to read a page when descending > the GIN tree. That explains the seemingly nonsensical accesses in the > log - the trace left out some crucial accesses. > > Attached is a new version of that debugging patch. Please repeat the > same steps as before with this: > > 1. Apply the patch (remove previous patch first if necessary) > > 2. Compile Postgres, "make install" > > 3. Clear the log, start postgres > > 4. Run the query again: > > SELECT ctid, entity_id FROM entity WHERE > ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0) > && '{1737}') LIMIT 10; > for clarity - i'm also running "set enable_seqscan=off" before that select to ensure that GIN is used. SELECT gave me ctid | entity_id --------------+----------- (4002784,1) | 38048120 (4002869,14) | 95333744 (2 rows) > 5. Stop the server. > > 6. Extract the content of the accessed index blocks: > > cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print > "$1\n" ' | sort -n |uniq > /tmp/blocknums > > psql data -c "create temp table blocknums(blknum int); copy blocknums > from '/tmp/blocknums'; copy (select blknum, > get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to > '/tmp/block-contents';" > > 7. Send over /tmp/blocknums, /tmp/block-contents and the postgres log again. > I'm sending those over directly to your mail. > Thank you for your patience! > Thanks a lot for the investigation! -- regards, Pawel Kudzia
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 18/07/2021 11:27, Pawel Kudzia wrote: > I'm sending those over directly to your mail. Thanks! I looked through those logs, and I'm now not seeing anything wrong with the pages involved. The query first scans the pending list, and doesn't find any matches there. It then descends the entry tree, finds pointer to a posting tree for id 1373, then scans the posting tree. For some reason, the TIDs of those offending rows, (4002784,1) and (4002869,14) are in the posting tree for key '1373', even though the heap tuples don't contain that key. Both of those TIDs are in the same TID list, on page 255179: > postgres=# select * from (select blknum, ((gin_leafpage_items(content)).*) from ginblocks where blknum=255179) x wheretids::text like '%(4002869,14)%'; > blknum | first_tid | nbytes | > tids > > > --------+--------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------- > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------------------------------- > 255179 | (4002720,16) | 87 | {"(4002720,16)","(4002721,3)","(4002723,5)","(4002739,11)","(4002743,14)","(4002744,2)","(4002784,1)","(4002784,10)","(4002790,46)","(4002790,47)","(4002792 > ,22)","(4002792,23)","(4002792,24)","(4002796,22)","(4002798,2)","(4002800,7)","(4002802,27)","(4002802,28)","(4002803,22)","(4002805,28)","(4002816,13)","(4002817,19)","(4002820,2)","(40028 > 25,17)","(4002854,7)","(4002855,13)","(4002855,14)","(4002869,14)","(4006217,3)","(4006217,11)","(4006307,11)","(4006330,4)","(4006330,8)","(4006331,6)","(4006332,9)","(4006332,10)","(400633 > 3,9)","(4006335,2)","(4006338,1)","(4006342,7)","(4006343,5)","(4006407,2)","(4008912,2)"} > (1 row) I've used pg_filedump and pageinspect on all of the pages on the path to this entry, but I don't see anything wrong with the index. So how did it end up like that? I'm afraid I have no clue :-(. Peter's hunch that it's some bad interaction between deleted pages and the pending list sounds plausible, but looking at the code, I don't see any concrete bugs like that. We have one more lead that we haven't fully explored: amcheck reported this error: ERROR: index "entity_attribute_name_ids_gin": tid exceeds parent's high key in postingTree leaf on block 321746 Give that the GIN amheck patch is still very experimental, that could well be a bug in amcheck instead of real corruption. Also, that error was *not* in the posting tree for key '1373'. I hacked amcheck on my laptop to run only on the posting tree or '1373', and it didn't find any errors there. But let's check it out. I modified 'amcheck' to print more information. I also changed it so that it doesn't stop at the first error, but keeps going, and reports all issues as warnings. Pawel: Could you try the attached version of amcheck? Please apply the attached patch (you'll have to un-apply the previous version or start from a clean source tree), run "SELECT gin_index_parent_check('entity_attribute_name_ids_gin');" again, and send over the result? It should print a bunch of INFOs and WARNINGs to the client's stderr, so a good way to capture that is something like this: psql -c "select gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log Let's see if that reveals something interesting. If not, what can we do next? 1. Let's fix the issues with the triconsistent functions that we discussed earlier. They don't explain what we're seeing here, but should be cleaned up nevertheless. 2. Let's add more sanity checks wherever we read a GIN page, to check that it's of the expected kind. Again we have no theory on what sequence of events could cause this, but having more sanity checks seems like a good idea in any case. 3. Run the test case again and wait for the issue to re-appear. This time, let's try to capture the WAL, so that we can trace back the modifications to the pages that lead to the situation. Any other ideas? - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Tue, Jul 20, 2021 at 1:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > [..] > > > Pawel: Could you try the attached version of amcheck? Please apply the > attached patch (you'll have to un-apply the previous version or start > from a clean source tree), run "SELECT > gin_index_parent_check('entity_attribute_name_ids_gin');" again, and > send over the result? It should print a bunch of INFOs and WARNINGs to > the client's stderr, so a good way to capture that is something like this: > > psql -c "select > gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log > i'm attaching log of postgresql generated while running the command above and amcheck-log. thanks a lot! -- regards, Pawel Kudzia
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 20/07/2021 15:22, Pawel Kudzia wrote: > On Tue, Jul 20, 2021 at 1:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Pawel: Could you try the attached version of amcheck? Please apply the >> attached patch (you'll have to un-apply the previous version or start >> from a clean source tree), run "SELECT >> gin_index_parent_check('entity_attribute_name_ids_gin');" again, and >> send over the result? It should print a bunch of INFOs and WARNINGs to >> the client's stderr, so a good way to capture that is something like this: >> >> psql -c "select >> gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log >> > > i'm attaching log of postgresql generated while running the command > above and amcheck-log. > > thanks a lot! Great! So it's reporting two kinds of issues. First this: WARNING: index "entity_attribute_name_ids_gin": tid exceeds parent's high key in postingTree leaf on block 343573 I looked at the amcheck code again, and that is in fact bogus; there is nothing wrong with the high keys, but there is a memory management bug in the amcheck patch that's causing it. The other error it complains about looks like this: ... INFO: key (1743225, 6) -> 363962 INFO: key (1746295, 15) -> 341281 INFO: key (1747712, 12) -> 343607 INFO: key (1749215, 4) -> 376056 INFO: key (262563, 1) -> 11547 INFO: key (3906513777, 43269) -> 31653888 INFO: key (536354808, 8196) -> 4432 ERROR: index "entity_attribute_name_ids_gin" has wrong tuple order in posting tree, block 371806, offset 818 Looking at the amcheck code for that check, that's a false alarm too. amcheck is calculating the size of this array incorrectly, thinking that there are more entries on the page than there really are. The entries above up to (1749215, 4) are valid, but the rest are not real entries, just amcheck's misinterpretation. Unfortunately 'amcheck' stopped at that error again. Sigh. I fixed those issues in amcheck again. And also added some more checks on the high keys. At least we're making the amcheck patch better, even if we're not any wiser about the original issue... Pawel: Can you test once again with the attached amcheck version? - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > [..] > > Looking at the amcheck code for that check, that's a false alarm too. > amcheck is calculating the size of this array incorrectly, thinking that > there are more entries on the page than there really are. The entries > above up to (1749215, 4) are valid, but the rest are not real entries, > just amcheck's misinterpretation. Unfortunately 'amcheck' stopped at > that error again. > > Sigh. I fixed those issues in amcheck again. And also added some more > checks on the high keys. At least we're making the amcheck patch better, > even if we're not any wiser about the original issue... > > Pawel: Can you test once again with the attached amcheck version? patch v4-0001-Amcheck-for-GIN-13stable.patch applied to postgresql-13_13.3.orig.tar.bz2 since the compressed logs are quite large - i've sent them directly to Heikki. i've executing "select gin_index_parent_check('entity_attribute_name_ids_gin');" took considerably longer this time and it ended with : [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in amcheck.so[7f214c209000+6000] [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7 43 04 > > - Heikki -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 20/07/2021 18:58, Pawel Kudzia wrote: > On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Pawel: Can you test once again with the attached amcheck version? > > patch v4-0001-Amcheck-for-GIN-13stable.patch applied to > postgresql-13_13.3.orig.tar.bz2 > attached - log and result of the check. > > executing "select > gin_index_parent_check('entity_attribute_name_ids_gin');" took > considerably longer this time and it ended with : > > [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip > 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in > amcheck.so[7f214c209000+6000] > [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49 > 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46 > 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7 > 43 04 Hmph, another set of bugs in the amcheck patch. It didn't handle empty entry tree items, nor empty posting tree pages. And there was another bug, which caused it to scan entry tree pages multiple times; that's why the same WARNING was printed multiple times. Fixed those bugs, new patch version attached. Pawel, can you test this again, please? At this point, I'm pretty sure this isn't going to reveal any more information about the original problem, but at least we're ironing out bugs from the 'amcheck' patch.. I'm grasping at straws here, but here's one more thing we could try: the query returned these incorrect tuples: ctid | entity_id --------------+----------- (4002784,1) | 38048120 (4002869,14) | 95333744 (2 rows) We know those entries are in the GIN index with key '1373', when they shouldn't be. But I wonder if the correct keys for those tuples are present? Pawel, can you try this, please: -- persuade planner to use a bitmap index scan set enable_seqscan=off; set enable_tidscan=off; -- avoid lossy bitmaps set work_mem='1GB'; -- Explain analyze first to check it's using a bitmap index scan and -- no lossy pages. explain analyze SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{38048120}' and ctid='(4002784,1)'; SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{38048120}' and ctid='(4002784,1)'; explain analyze SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{95333744}' and ctid='(4002869,14)'; SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{95333744}' and ctid='(4002869,14)'; - Heikki
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 20/07/2021 18:58, Pawel Kudzia wrote: > > On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> Pawel: Can you test once again with the attached amcheck version? > > > > patch v4-0001-Amcheck-for-GIN-13stable.patch applied to > > postgresql-13_13.3.orig.tar.bz2 > > attached - log and result of the check. > > > > executing "select > > gin_index_parent_check('entity_attribute_name_ids_gin');" took > > considerably longer this time and it ended with : > > > > [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip > > 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in > > amcheck.so[7f214c209000+6000] > > [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49 > > 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46 > > 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7 > > 43 04 > > Hmph, another set of bugs in the amcheck patch. It didn't handle empty > entry tree items, nor empty posting tree pages. And there was another > bug, which caused it to scan entry tree pages multiple times; that's why > the same WARNING was printed multiple times. > > Fixed those bugs, new patch version attached. Pawel, can you test this > again, please? At this point, I'm pretty sure this isn't going to reveal > any more information about the original problem, but at least we're > ironing out bugs from the 'amcheck' patch.. thank you for the next patch Heikki! no crash this time! i'm sending a message in a separate mail since i'm not sure if it'll pass through attachment size limit that's applied for the list. > I'm grasping at straws here, but here's one more thing we could try: the > query returned these incorrect tuples: > > ctid | entity_id > --------------+----------- > (4002784,1) | 38048120 > (4002869,14) | 95333744 > (2 rows) > > We know those entries are in the GIN index with key '1373', when they > shouldn't be. But I wonder if the correct keys for those tuples are > present? Pawel, can you try this, please: > > -- persuade planner to use a bitmap index scan > set enable_seqscan=off; > set enable_tidscan=off; > -- avoid lossy bitmaps > set work_mem='1GB'; > > -- Explain analyze first to check it's using a bitmap index scan and > -- no lossy pages. > explain analyze > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{38048120}' and ctid='(4002784,1)'; > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{38048120}' and ctid='(4002784,1)'; > > explain analyze > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{95333744}' and ctid='(4002869,14)'; > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{95333744}' and ctid='(4002869,14)'; data=# explain analyze SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{38048120}' and ctid='(4002784,1)'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=72.70..3366.69 rows=1 width=14) (actual time=42.995..42.998 rows=0 loops=1) Recheck Cond: (attribute_name_ids && '{38048120}'::integer[]) Filter: (ctid = '(4002784,1)'::tid) -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..72.70 rows=33055 width=0) (actual time=42.989..42.990 rows=0 loops=1) Index Cond: (attribute_name_ids && '{38048120}'::integer[]) Planning Time: 18.912 ms Execution Time: 43.515 ms (7 rows) data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{38048120}' and ctid='(4002784,1)'; ctid | entity_id ------+----------- (0 rows) data=# explain analyze SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{95333744}' and ctid='(4002869,14)'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=72.70..3366.69 rows=1 width=14) (actual time=2.552..2.554 rows=0 loops=1) Recheck Cond: (attribute_name_ids && '{95333744}'::integer[]) Filter: (ctid = '(4002869,14)'::tid) -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..72.70 rows=33055 width=0) (actual time=2.543..2.544 rows=0 loops=1) Index Cond: (attribute_name_ids && '{95333744}'::integer[]) Planning Time: 0.193 ms Execution Time: 2.594 ms (7 rows) data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && '{95333744}' and ctid='(4002869,14)'; ctid | entity_id ------+----------- (0 rows) > > - Heikki -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jul 22, 2021 at 11:07 AM Pawel Kudzia <kudzia@gmail.com> wrote: > > On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 20/07/2021 18:58, Pawel Kudzia wrote: > > > On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> Pawel: Can you test once again with the attached amcheck version? > > > > > > patch v4-0001-Amcheck-for-GIN-13stable.patch applied to > > > postgresql-13_13.3.orig.tar.bz2 > > > attached - log and result of the check. > > > > > > executing "select > > > gin_index_parent_check('entity_attribute_name_ids_gin');" took > > > considerably longer this time and it ended with : > > > > > > [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip > > > 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in > > > amcheck.so[7f214c209000+6000] > > > [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49 > > > 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46 > > > 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7 > > > 43 04 > > > > Hmph, another set of bugs in the amcheck patch. It didn't handle empty > > entry tree items, nor empty posting tree pages. And there was another > > bug, which caused it to scan entry tree pages multiple times; that's why > > the same WARNING was printed multiple times. > > > > Fixed those bugs, new patch version attached. Pawel, can you test this > > again, please? At this point, I'm pretty sure this isn't going to reveal > > any more information about the original problem, but at least we're > > ironing out bugs from the 'amcheck' patch.. > > thank you for the next patch Heikki! > no crash this time! i'm sending a message in a separate mail since i'm > not sure if it'll pass through attachment size limit that's applied > for the list. there goes the attachment. > > > I'm grasping at straws here, but here's one more thing we could try: the > > query returned these incorrect tuples: > > > > ctid | entity_id > > --------------+----------- > > (4002784,1) | 38048120 > > (4002869,14) | 95333744 > > (2 rows) > > > > We know those entries are in the GIN index with key '1373', when they > > shouldn't be. But I wonder if the correct keys for those tuples are > > present? Pawel, can you try this, please: > > > > -- persuade planner to use a bitmap index scan > > set enable_seqscan=off; > > set enable_tidscan=off; > > -- avoid lossy bitmaps > > set work_mem='1GB'; > > > > -- Explain analyze first to check it's using a bitmap index scan and > > -- no lossy pages. > > explain analyze > > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > > '{38048120}' and ctid='(4002784,1)'; > > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > > '{38048120}' and ctid='(4002784,1)'; > > > > explain analyze > > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > > '{95333744}' and ctid='(4002869,14)'; > > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > > '{95333744}' and ctid='(4002869,14)'; > > > data=# explain analyze SELECT ctid, entity_id FROM entity WHERE > attribute_name_ids && '{38048120}' and ctid='(4002784,1)'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on entity (cost=72.70..3366.69 rows=1 width=14) > (actual time=42.995..42.998 rows=0 loops=1) > Recheck Cond: (attribute_name_ids && '{38048120}'::integer[]) > Filter: (ctid = '(4002784,1)'::tid) > -> Bitmap Index Scan on entity_attribute_name_ids_gin > (cost=0.00..72.70 rows=33055 width=0) (actual time=42.989..42.990 > rows=0 loops=1) > Index Cond: (attribute_name_ids && '{38048120}'::integer[]) > Planning Time: 18.912 ms > Execution Time: 43.515 ms > (7 rows) > > data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{38048120}' and ctid='(4002784,1)'; > ctid | entity_id > ------+----------- > (0 rows) > > data=# explain analyze SELECT ctid, entity_id FROM entity WHERE > attribute_name_ids && '{95333744}' and ctid='(4002869,14)'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on entity (cost=72.70..3366.69 rows=1 width=14) > (actual time=2.552..2.554 rows=0 loops=1) > Recheck Cond: (attribute_name_ids && '{95333744}'::integer[]) > Filter: (ctid = '(4002869,14)'::tid) > -> Bitmap Index Scan on entity_attribute_name_ids_gin > (cost=0.00..72.70 rows=33055 width=0) (actual time=2.543..2.544 rows=0 > loops=1) > Index Cond: (attribute_name_ids && '{95333744}'::integer[]) > Planning Time: 0.193 ms > Execution Time: 2.594 ms > (7 rows) > > data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && > '{95333744}' and ctid='(4002869,14)'; > ctid | entity_id > ------+----------- > (0 rows) > > > > > > > > - Heikki > > > > -- > regards, > Pawel Kudzia -- regards, Pawel Kudzia
Attachment
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 22/07/2021 12:07, Pawel Kudzia wrote: > On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> Fixed those bugs, new patch version attached. Pawel, can you test this >> again, please? At this point, I'm pretty sure this isn't going to reveal >> any more information about the original problem, but at least we're >> ironing out bugs from the 'amcheck' patch.. > > thank you for the next patch Heikki! > no crash this time! i'm sending a message in a separate mail since i'm > not sure if it'll pass through attachment size limit that's applied > for the list. Thanks! So looking at the log, amcheck is not reporting any more problems. >> I'm grasping at straws here, but here's one more thing we could try: the >> query returned these incorrect tuples: >> >> ctid | entity_id >> --------------+----------- >> (4002784,1) | 38048120 >> (4002869,14) | 95333744 >> (2 rows) >> >> We know those entries are in the GIN index with key '1373', when they >> shouldn't be. But I wonder if the correct keys for those tuples are >> present? Pawel, can you try this, please: > > [queries with nore rows returned] Ok, so the index is missing entries for the correct key. Looks like the index entries were inserted into the wrong subtree, under wrong key. But *how* did that happen? I'm out of ideas, I'm afraid :-(. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 22/07/2021 12:07, Pawel Kudzia wrote: > > On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> Fixed those bugs, new patch version attached. Pawel, can you test this > >> again, please? At this point, I'm pretty sure this isn't going to reveal > >> any more information about the original problem, but at least we're > >> ironing out bugs from the 'amcheck' patch.. > > > > thank you for the next patch Heikki! > > no crash this time! i'm sending a message in a separate mail since i'm > > not sure if it'll pass through attachment size limit that's applied > > for the list. > > Thanks! So looking at the log, amcheck is not reporting any more problems. > > >> I'm grasping at straws here, but here's one more thing we could try: the > >> query returned these incorrect tuples: > >> > >> ctid | entity_id > >> --------------+----------- > >> (4002784,1) | 38048120 > >> (4002869,14) | 95333744 > >> (2 rows) > >> > >> We know those entries are in the GIN index with key '1373', when they > >> shouldn't be. But I wonder if the correct keys for those tuples are > >> present? Pawel, can you try this, please: > > > > [queries with nore rows returned] > > Ok, so the index is missing entries for the correct key. Looks like the > index entries were inserted into the wrong subtree, under wrong key. But > *how* did that happen? I'm out of ideas, I'm afraid :-(. Thanks a lot for your patience and multiple patches that you've provided. Please pardon my ignorance - I don't have low-level understanding of how the query is being executed - but are you sure that index is missing entries and not the other way around - that it has too many entries? To recap - SELECT, answered based on the GIN, reports rows that actually do not match the criteria provided in WHERE. Just lowering work_mem makes the problem go away, whith GIN still being used. Greetings! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 23 July 2021 18:04:50 EEST, Pawel Kudzia <kudzia@gmail.com> wrote: >On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> Ok, so the index is missing entries for the correct key. Looks like the >> index entries were inserted into the wrong subtree, under wrong key. But >> *how* did that happen? I'm out of ideas, I'm afraid :-(. Can you tell more about the workload? What INSERT/UPDATE/DELETE commands do you run? How many concurrent clients? Do yourun vacuum manually or how often does autovacuum kick in? How long did it take for the problem to appear? What queriesdid you run to find the corruption, and how often? I know it's a big ask, but would it be possible to simplify the test case further, to make it reproduce faster? Ideally asa self-contained test script with any sensitive data removed. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Heikki Linnakangas
Date:
On 23/07/2021 18:04, Pawel Kudzia wrote: > Thanks a lot for your patience and multiple patches that you've > provided. Please pardon my ignorance - I don't have low-level > understanding of how the query is being executed - but are you sure > that index is missing entries and not the other way around - that it > has too many entries? To be precise, the index has an extra entry for row (4002784,1) with key 1373, and it's missing the entry with key 38048120. And for row (4002869,14), it has an extra row for key 1373, and it's missing the entry for key 95333744. > To recap - SELECT, answered based on the GIN, reports rows that > actually do not match the criteria provided in WHERE. Just lowering > work_mem makes the problem go away, whith GIN still being used. The reason that lowering work_mem hides the problem is that GIN collects all the matches in a so called TID bitmap, and if the bitmap grows too large compared to work_mem, it becomes lossy to save memory. When it's lossy, it only stores the heap block numbers of the matches. For a regular, non-lossy, match, the Bitmap Heap Scan just returns the row that the index says is a match. For the lossy matches, the Bitmap Heap Scan node needs to check every row on the page to see which ones actually match. This re-checking hides the problem that some of the matches that the index reported were not real. Note that you are also getting incorrect results with missing row for other queries. You can try it with e.g.: -- Using the index set enable_seqscan=off; set enable_bitmapscan=on; SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}'; SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}'; -- Without index set enable_seqscan=on; set enable_bitmapscan=off; SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}'; SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}'; I'll work on a patch to add more sanity checks to the GIN code when it traverses the tree, to catch the case that it accidentally steps on a wrong kind of a page (I'm pretty busy next week, so might not get to that until the week after though). I don't think that will help here, but who knows, and at least we can rule out some kinds of bugs. Alexander, can you finish the fixes to the ternary logic? That doesn't explain this corruption either, but we should fix it anyway. - Heikki
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Fri, Jul 23, 2021 at 11:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 23 July 2021 18:04:50 EEST, Pawel Kudzia <kudzia@gmail.com> wrote: > >On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> Ok, so the index is missing entries for the correct key. Looks like the > >> index entries were inserted into the wrong subtree, under wrong key. But > >> *how* did that happen? I'm out of ideas, I'm afraid :-(. > > Can you tell more about the workload? What INSERT/UPDATE/DELETE commands do you run? How many concurrent clients? Do yourun vacuum manually or how often does autovacuum kick in? How long did it take for the problem to appear? What queriesdid you run to find the corruption, and how often? > It's a mix with low percentage of simple INSERT INTO entity [ inserting single row within a transaction ], DELETE FROM entity [ deleting single row, selected by primary key ] and high number of UPDATE entity SET .. WHERE primname=?. There are few millions such updates a day. Single transaction has just one such update, update sets most of the table columns although in fact most of their values remain unchanged. Concurrency level - less than 50 parallel connections. Vacuum is run manually extremely rarely [ maybe 2-3x a year ]. Based on recent logs - autovacuum happens every ~ 7 days. Problem was discovered by coincidence - one of users noticed incorrect results returned, this provoked me to run a script that hammered database every hour with set of queries in this form: SELECT primname FROM entity WHERE ( attribute_name_ids && '{{$ani}}' ) AND NOT ( (attribute_name_ids||0) && '{{$ani}}') It would discover ~ 1-5 problematic entries each week, but there were periods of few weeks where new problems were not found [ note that this does not mean there were no such problems, i was hammering only one of GIN columns - where i had smallest universe of values to iterate over ]. > I know it's a big ask, but would it be possible to simplify the test case further, to make it reproduce faster? Ideallyas a self-contained test script with any sensitive data removed. > Sadly - I've put quite a lot of effort to reproduce the issue in test environment and was never able to do it. Problem occured on the production only, after prolonged execution of live update traffic. Note that in the I've moved database to another physical server to eliminate potential hardware malfunction - that did not help, problem continued to occur. I've also upgraded from 12.5 to 13.3. I no longer have a production setup that uses index built in this way: CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids public.gin__int_ops); 2 weeks ago i've moved to CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids); And have not seen the problem manifest again. Nevertheless - I have 2 file-level dumps of databases where we caught the inconsistency and would like to help with diagnosing the problem. -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Sun, Jul 25, 2021 at 9:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 23/07/2021 18:04, Pawel Kudzia wrote: > > Thanks a lot for your patience and multiple patches that you've > > provided. Please pardon my ignorance - I don't have low-level > > understanding of how the query is being executed - but are you sure > > that index is missing entries and not the other way around - that it > > has too many entries? > > To be precise, the index has an extra entry for row (4002784,1) with key > 1373, and it's missing the entry with key 38048120. And for row > (4002869,14), it has an extra row for key 1373, and it's missing the > entry for key 95333744. > > > To recap - SELECT, answered based on the GIN, reports rows that > > actually do not match the criteria provided in WHERE. Just lowering > > work_mem makes the problem go away, whith GIN still being used. > > The reason that lowering work_mem hides the problem is that GIN collects > all the matches in a so called TID bitmap, and if the bitmap grows too > large compared to work_mem, it becomes lossy to save memory. When it's > lossy, it only stores the heap block numbers of the matches. For a > regular, non-lossy, match, the Bitmap Heap Scan just returns the row > that the index says is a match. For the lossy matches, the Bitmap Heap > Scan node needs to check every row on the page to see which ones > actually match. This re-checking hides the problem that some of the > matches that the index reported were not real. Thank you for the explanation! > > Note that you are also getting incorrect results with missing row for > other queries. You can try it with e.g.: > > -- Using the index > set enable_seqscan=off; > set enable_bitmapscan=on; > SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}'; > SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}'; > > -- Without index > set enable_seqscan=on; > set enable_bitmapscan=off; > SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}'; > SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}'; > > Actually - both give identical results - count(*) = 0. I think you actually wanted me to run those: data=# set enable_seqscan=off; SET data=# set enable_bitmapscan=on; SET data=# SELECT count(*) FROM entity WHERE attribute_name_ids && '{1737}'; count ------- 79565 (1 row) data=# set enable_seqscan=on; SET data=# set enable_bitmapscan=off; SET data=# SELECT count(*) FROM entity WHERE attribute_name_ids && '{1737}'; count ------- 79560 (1 row) Results indeed differ. > I'll work on a patch to add more sanity checks to the GIN code when it > traverses the tree, to catch the case that it accidentally steps on a > wrong kind of a page (I'm pretty busy next week, so might not get to > that until the week after though). I don't think that will help here, > but who knows, and at least we can rule out some kinds of bugs. > Alexander, can you finish the fixes to the ternary logic? That doesn't > explain this corruption either, but we should fix it anyway. > Thanks a lot! I'm happy to test. Greetings! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
Just to make sure: are you using any kind of storage or other virtualization, or a SAN? And what is your approach to backups?
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Sun, Jul 25, 2021 at 10:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Just to make sure: are you using any kind of storage or other virtualization, or a SAN? And what is your approach to backups? > > Peter Geoghegan > (Sent from my phone) It's locally attached storage: 2x NVMe installed in the server on them mdadm RAID1 on that LUKS full disk encryption on that ext4 please note that exactly the same storage / partition holds also MySQL and Sphinxsearch databases - all those run at the same server, under different users. all of them [ PG, MySQL, Sphinx ] are under heavy read traffic and update traffic comparable to one described above for postgresql - there are no consistency issues in Sphinx, MySQL databases. postgresql backups are done with pg_dumpall every 24h, we also use streaming replication so servers where inconsistencies were found are also masters for 5 replicas. -- regards, Pawel Kudzia
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Pawel Kudzia <kudzia@gmail.com> writes: > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB > > or less makes the problem go away: > > Hmm. So that suggests that the index itself is *not* corrupt, > but the problem is associated with a bug in the indexscan > algorithms. > > Have you experimented with different index opclasses? Your > original report said you were using gin__int_ops, but that's > not the default, and we sort of deprecate intarray these days. Hello, Three weeks ago we've changed from CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids public.gin__int_ops); into CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids); Sadly - the problems are back. I have: * queries that return incorrect responses: SET enable_seqscan = OFF; SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' ) AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10; SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' ) AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10; SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10; all three return: entity_id | primname -----------+-------------------- 99311962 | msci_674591ltepsgt 99312880 | msci_674667mgu data=# explain analyze SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=30.59..31.71 rows=10 width=8) (actual time=149.988..200.749 rows=2 loops=1) -> Bitmap Heap Scan on entity (cost=30.59..3798.81 rows=33644 width=8) (actual time=149.985..200.744 rows=2 loops=1) Recheck Cond: (attribute_name_ids && '{2652}'::integer[]) Rows Removed by Index Recheck: 79741 Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[])) Rows Removed by Filter: 16723 Heap Blocks: exact=8277 lossy=9989 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713 rows=37367 loops=1) Index Cond: (attribute_name_ids && '{2652}'::integer[]) Planning Time: 0.472 ms Execution Time: 200.977 ms (11 rows) * first time when hourly consistency check reported the inconsistency [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for entity_id=99311962 ] * timestamps when affected row has been updated [ entity_id=99312880: 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18, 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59, 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ], 2021-08-02 01:08:16 ] * wal files from the **master server** from the time when the inconsistency started to occur [ from 2021-08-01 14:50:37 till 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole period between snapshots mentioned below ] * file-level snapshot of the database taken from streaming **replication slave** before the problem started to occur [ 2021-08-01 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ] I'll be happy to run any diagnostics needed, provide access to particular wal file. Greetings! -- regards, Pawel Kudzia
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Mon, Aug 2, 2021 at 10:07 AM Pawel Kudzia <kudzia@gmail.com> wrote: > > On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Pawel Kudzia <kudzia@gmail.com> writes: > > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB > > > or less makes the problem go away: > > > > Hmm. So that suggests that the index itself is *not* corrupt, > > but the problem is associated with a bug in the indexscan > > algorithms. > > > > Have you experimented with different index opclasses? Your > > original report said you were using gin__int_ops, but that's > > not the default, and we sort of deprecate intarray these days. > > Hello, > > Three weeks ago we've changed from > > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin > (attribute_name_ids public.gin__int_ops); > > into > > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin > (attribute_name_ids); > > Sadly - the problems are back. > > I have: > * queries that return incorrect responses: > SET enable_seqscan = OFF; > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' ) > AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10; > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' ) > AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10; > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' ) > AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10; > > all three return: > entity_id | primname > -----------+-------------------- > 99311962 | msci_674591ltepsgt > 99312880 | msci_674667mgu > > data=# explain analyze SELECT entity_id FROM entity WHERE ( > attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) && > '{2652}') LIMIT 10; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=30.59..31.71 rows=10 width=8) (actual > time=149.988..200.749 rows=2 loops=1) > -> Bitmap Heap Scan on entity (cost=30.59..3798.81 rows=33644 > width=8) (actual time=149.985..200.744 rows=2 loops=1) > Recheck Cond: (attribute_name_ids && '{2652}'::integer[]) > Rows Removed by Index Recheck: 79741 > Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[])) > Rows Removed by Filter: 16723 > Heap Blocks: exact=8277 lossy=9989 > -> Bitmap Index Scan on entity_attribute_name_ids_gin > (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713 > rows=37367 loops=1) > Index Cond: (attribute_name_ids && '{2652}'::integer[]) > Planning Time: 0.472 ms > Execution Time: 200.977 ms > (11 rows) > > > * first time when hourly consistency check reported the inconsistency > [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for > entity_id=99311962 ] > * timestamps when affected row has been updated [ entity_id=99312880: > 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led > to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18, > 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59, > 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ], > 2021-08-02 01:08:16 ] > * wal files from the **master server** from the time when the > inconsistency started to occur [ from 2021-08-01 14:50:37 till > 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole > period between snapshots mentioned below ] > * file-level snapshot of the database taken from streaming > **replication slave** before the problem started to occur [ 2021-08-01 > 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ] > > I'll be happy to run any diagnostics needed, provide access to > particular wal file. > Hello, We continue to run into this issue and we still do not have any reliable method to reproduce it. Is there any diagnostic information that we could provide to make it easier to resolve? Thank you! -- regards, Pawel Kudzia
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Peter Geoghegan
Date:
On Sun, Jul 25, 2021 at 12:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'll work on a patch to add more sanity checks to the GIN code when it > traverses the tree, to catch the case that it accidentally steps on a > wrong kind of a page (I'm pretty busy next week, so might not get to > that until the week after though). I don't think that will help here, > but who knows, and at least we can rule out some kinds of bugs. I just noticed that among call sites to ginInsertCleanup() located in ginvacuum.c, only one caller specifies fill_fsm=true: The call made when an autoanalyze (i.e. av worker that just does an ANALYZE) is run. In other words, a call through the special analyze_only path [1]. Note, in particular, that a plain VACUUM, or a plain VACUUM ANALYZE will always specify fill_fsm=false, regardless of any of the details. This seems totally contradictory to me. (***Thinks some more***) Actually, that's not quite right -- it's not contradictory once you realize that fill_fsm=true is an extra special path, originally designed just for the analyze_only path. That's easy to see once you look at commit dc943ad952 from 2015. Its specific goal was to allow this special analyze_only path to recycle pages. We need fill_fsm=true because the analyze_only path won't scan the whole index again at the end of ginvacuumcleanup(). (Actually, why even scan it in the similar cleanup-only VACUUM case? Ugh, that's not worth getting into now.) In summary: Although I always suspected the fill_fsm=true ShiftList() path here, it's even easier to suspect it now. Because I see now that it hardly ever gets used inside autovacuum worker processes, though of course it is possible. It's probably a matter of luck as to whether you hit the analyze_only + fill_fsm=true ShiftList() path. That might well have contributed to our difficulty in recreating the bugs here. This is such a mess. I'm not sure that this insight will be news to you, Heikki. Perhaps I'm just venting about the confused state of the code here, following a recent unpleasant reminder (again, see [1]). [1] https://postgr.es/m/CAH2-WzkjrK556enVtFLmyXEdw91xGuwiyZVep2kp5yQT_-3JDg@mail.gmail.com -- Peter Geoghegan
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
From
Pawel Kudzia
Date:
On Mon, Sep 6, 2021 at 10:18 AM Pawel Kudzia <kudzia@gmail.com> wrote: > > On Mon, Aug 2, 2021 at 10:07 AM Pawel Kudzia <kudzia@gmail.com> wrote: > > > > On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > Pawel Kudzia <kudzia@gmail.com> writes: > > > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB > > > > or less makes the problem go away: > > > > > > Hmm. So that suggests that the index itself is *not* corrupt, > > > but the problem is associated with a bug in the indexscan > > > algorithms. > > > > > > Have you experimented with different index opclasses? Your > > > original report said you were using gin__int_ops, but that's > > > not the default, and we sort of deprecate intarray these days. > > > > Hello, > > > > Three weeks ago we've changed from > > > > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin > > (attribute_name_ids public.gin__int_ops); > > > > into > > > > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin > > (attribute_name_ids); > > > > Sadly - the problems are back. > > > > I have: > > * queries that return incorrect responses: > > SET enable_seqscan = OFF; > > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' ) > > AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10; > > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' ) > > AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10; > > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' ) > > AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10; > > > > all three return: > > entity_id | primname > > -----------+-------------------- > > 99311962 | msci_674591ltepsgt > > 99312880 | msci_674667mgu > > > > data=# explain analyze SELECT entity_id FROM entity WHERE ( > > attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) && > > '{2652}') LIMIT 10; > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Limit (cost=30.59..31.71 rows=10 width=8) (actual > > time=149.988..200.749 rows=2 loops=1) > > -> Bitmap Heap Scan on entity (cost=30.59..3798.81 rows=33644 > > width=8) (actual time=149.985..200.744 rows=2 loops=1) > > Recheck Cond: (attribute_name_ids && '{2652}'::integer[]) > > Rows Removed by Index Recheck: 79741 > > Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[])) > > Rows Removed by Filter: 16723 > > Heap Blocks: exact=8277 lossy=9989 > > -> Bitmap Index Scan on entity_attribute_name_ids_gin > > (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713 > > rows=37367 loops=1) > > Index Cond: (attribute_name_ids && '{2652}'::integer[]) > > Planning Time: 0.472 ms > > Execution Time: 200.977 ms > > (11 rows) > > > > > > * first time when hourly consistency check reported the inconsistency > > [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for > > entity_id=99311962 ] > > * timestamps when affected row has been updated [ entity_id=99312880: > > 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led > > to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18, > > 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59, > > 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ], > > 2021-08-02 01:08:16 ] > > * wal files from the **master server** from the time when the > > inconsistency started to occur [ from 2021-08-01 14:50:37 till > > 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole > > period between snapshots mentioned below ] > > * file-level snapshot of the database taken from streaming > > **replication slave** before the problem started to occur [ 2021-08-01 > > 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ] > > > > I'll be happy to run any diagnostics needed, provide access to > > particular wal file. > > > > Hello, > > We continue to run into this issue and we still do not have any > reliable method to reproduce it. Is there any diagnostic information > that we could provide to make it easier to resolve? > > Thank you! > > Hello, I'd like to revive this thread. In the meanwhile we've upgraded to PostgreSQL 16.3-1.pgdg110+1, moved to new x64 hardware and upgraded to Debian Bookworm. And we still observe occasional GIN index corruption resulting in 'impossible' queries like below returning rows which clearly do not match the WHERE criteria: data=# SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{49}' ) AND NOT ( (attribute_name_ids||0) && '{49}'); entity_id ----------- 88036166 (1 row) data=# explain analyze SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{49}' ) AND NOT ( (attribute_name_ids||0) && '{49}'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=59.10..361.74 rows=3011 width=8) (actual time=41.614..46.918 rows=1 loops=1) Recheck Cond: (attribute_name_ids && '{49}'::integer[]) Filter: (NOT ((attribute_name_ids || 0) && '{49}'::integer[])) Rows Removed by Filter: 2798 Heap Blocks: exact=4715 -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..59.10 rows=3026 width=0) (actual time=6.563..6.564 rows=17842 loops=1) Index Cond: (attribute_name_ids && '{49}'::integer[]) Planning Time: 0.115 ms Execution Time: 47.032 ms (9 rows) The only step needed to get into this state is to leave our update processes running for few weeks. Our workaround, which has been working OK for us for the past 4 years, is to run REINDEX INDEX CONCURRENTLY entity_attribute_value_ids_gin every 24h. If there's any additional diagnostic data that we can provide to help in finding and fixing the root cause - please let me know. Thank you! -- regards, Pawel Kudzia