Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: snapshot too old issues, first around wraparound and then more. |
Date | |
Msg-id | 20200401235951.vo4ilpdcukdodju2@alap3.anarazel.de Whole thread Raw |
In response to | Re: snapshot too old issues, first around wraparound and then more. (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: snapshot too old issues, first around wraparound and then more.
Re: snapshot too old issues, first around wraparound and then more. Re: snapshot too old issues, first around wraparound and then more. Re: snapshot too old issues, first around wraparound and then more. Re: snapshot too old issues, first around wraparound and then more. |
List | pgsql-hackers |
Hi, On 2020-04-01 15:30:39 -0700, Peter Geoghegan wrote: > On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I like that idea. I think that I've spotted what may be an independent > > bug, but I have to wait around for a minute or two to reproduce it > > each time. Makes it hard to get to a minimal test case. > > I now have simple steps to reproduce a bug when I start Postgres > master with "--old_snapshot_threshold=1" (1 minute). Thanks, that's super helpful. > This example shows wrong answers to queries in session 2: > > Session 1: > > pg@regression:5432 [1444078]=# create table snapshot_bug (col int4); > CREATE TABLE > pg@regression:5432 [1444078]=# create index on snapshot_bug (col ); > CREATE INDEX > pg@regression:5432 [1444078]=# insert into snapshot_bug select i from > generate_series(1, 500) i; > INSERT 0 500 > > Session 2 starts, and views the data in a serializable transaction: > > pg@regression:5432 [1444124]=# begin isolation level serializable ; > BEGIN > pg@regression:5432 [1444124]=*# select col from snapshot_bug where col > >= 0 order by col limit 14; > ┌─────┐ > │ col │ > ├─────┤ > │ 1 │ > │ 2 │ > │ 3 │ > │ 4 │ > │ 5 │ > │ 6 │ > │ 7 │ > │ 8 │ > │ 9 │ > │ 10 │ > │ 11 │ > │ 12 │ > │ 13 │ > │ 14 │ > └─────┘ > (14 rows) > > So far so good. Now session 2 continues: > pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15; > DELETE 14 I got a bit confused here - you seemed to have switched session 1 and 2 around? Doesn't seem to matter much though, I was able to reproduce this. This indeed seems a separate bug. The backtrace to the point where the xmin horizon is affected by TransactionIdLimitedForOldSnapshots() is: #0 TransactionIdLimitedForOldSnapshots (recentXmin=2082816071, relation=0x7f52ff3b56f8) at /home/andres/src/postgresql/src/backend/utils/time/snapmgr.c:1870 #1 0x00005567f4cd1a55 in heap_page_prune_opt (relation=0x7f52ff3b56f8, buffer=175) at /home/andres/src/postgresql/src/backend/access/heap/pruneheap.c:106 #2 0x00005567f4cc70e2 in heapam_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60, call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at /home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:137 #3 0x00005567f4cdf5e6 in table_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60, call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at /home/andres/src/postgresql/src/include/access/tableam.h:1020 #4 0x00005567f4ce0767 in index_fetch_heap (scan=0x5567f6db2de0, slot=0x5567f6db1b60) at /home/andres/src/postgresql/src/backend/access/index/indexam.c:577 #5 0x00005567f4f19191 in IndexOnlyNext (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:169 #6 0x00005567f4ef4bc4 in ExecScanFetch (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>, recheckMtd=0x5567f4f1951c<IndexOnlyRecheck>) at /home/andres/src/postgresql/src/backend/executor/execScan.c:133 #7 0x00005567f4ef4c39 in ExecScan (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>, recheckMtd=0x5567f4f1951c<IndexOnlyRecheck>) at /home/andres/src/postgresql/src/backend/executor/execScan.c:182 #8 0x00005567f4f195d4 in ExecIndexOnlyScan (pstate=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:317 #9 0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:444 #10 0x00005567f4f1d694 in ExecProcNode (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/include/executor/executor.h:245 #11 0x00005567f4f1d7d2 in ExecLimit (pstate=0x5567f6db14b8) at /home/andres/src/postgresql/src/backend/executor/nodeLimit.c:95 #12 0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db14b8) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:444 #13 0x00005567f4ee57c3 in ExecProcNode (node=0x5567f6db14b8) at /home/andres/src/postgresql/src/include/executor/executor.h:245 #14 0x00005567f4ee83dd in ExecutePlan (estate=0x5567f6db1280, planstate=0x5567f6db14b8, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x5567f6db3c78, execute_once=true) at /home/andres/src/postgresql/src/backend/executor/execMain.c:1646 #15 0x00005567f4ee5e23 in standard_ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0, execute_once=true) at /home/andres/src/postgresql/src/backend/executor/execMain.c:364 #16 0x00005567f4ee5c35 in ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0, execute_once=true) at /home/andres/src/postgresql/src/backend/executor/execMain.c:308 #17 0x00005567f510c4de in PortalRunSelect (portal=0x5567f6d49260, forward=true, count=0, dest=0x5567f6db3c78) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:912 #18 0x00005567f510c191 in PortalRun (portal=0x5567f6d49260, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5567f6db3c78, altdest=0x5567f6db3c78, qc=0x7ffce13d7de0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:756 #19 0x00005567f5106015 in exec_simple_query (query_string=0x5567f6cdd7a0 "select col from snapshot_bug where col >= 0 orderby col limit 14;") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1239 which in my tree is the elog() in the block below: if (!same_ts_as_threshold) { if (ts == update_ts) { PrintOldSnapshotMapping("non cached limit via update_ts", false); xlimit = latest_xmin; if (NormalTransactionIdFollows(xlimit, recentXmin)) { elog(LOG, "increasing threshold from %u to %u (via update_ts)", recentXmin, xlimit); SetOldSnapshotThresholdTimestamp(ts, xlimit); } } the mapping at that point is: 2020-04-01 16:14:00.025 PDT [1272381][4/2:0] WARNING: old snapshot mapping at "non cached limit via update_ts" with headts: 1, current entries: 2 max entries: 11, offset: 0 entry 0 (ring 0): min 1: xid 2082816067 entry 1 (ring 1): min 2: xid 2082816071 and the xmin changed is: 2020-04-01 16:14:00.026 PDT [1272381][4/2:0] LOG: increasing threshold from 2082816071 to 2082816072 (via update_ts) in the frame of heap_prune_page_opt(): (rr) p snapshot->whenTaken $5 = 639097973135655 (rr) p snapshot->lsn $6 = 133951784192 (rr) p MyPgXact->xmin $7 = 2082816071 (rr) p BufferGetBlockNumber(buffer) $11 = 0 in the frame for TransactionIdLimitedForOldSnapshots: (rr) p ts $8 = 639098040000000 (rr) p latest_xmin $9 = 2082816072 (rr) p update_ts $10 = 639098040000000 The primary issue here is that there is no TestForOldSnapshot() in heap_hot_search_buffer(). Therefore index fetches will never even try to detect that tuples it needs actually have already been pruned away. The wrong queries I saw took longer to reproduce, so I've not been able to debug the precise reasons. Greetings, Andres Freund
pgsql-hackers by date: