Re: INSERT...ON DUPLICATE KEY IGNORE - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: INSERT...ON DUPLICATE KEY IGNORE |
Date | |
Msg-id | 20130831183426.GA6989@awork2.anarazel.de Whole thread Raw |
In response to | Re: INSERT...ON DUPLICATE KEY IGNORE (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: INSERT...ON DUPLICATE KEY IGNORE
Re: INSERT...ON DUPLICATE KEY IGNORE |
List | pgsql-hackers |
Hi, On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote: > On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > While I ponder on it some more, could you explain whether I understood > > something correctly? Consider the following scenario: > > > > CREATE TABLE foo(a int UNIQUE, b int UNIQUE); > > > > S1: INSERT INTO foo(0, 0); > > S1: BEGIN; > > S1: INSERT INTO foo(1, 1); > > S1: SELECT pg_sleep(3600); > > S2: BEGIN; > > S2: INSERT INTO foo(2, 1); > > S3: SELECT * FROM foo WHERE a = 0; > > > > Won't S2 in this case exclusively lock a page in foo_a_key (the one > > where 2 will reside on) for 3600s, while it's waiting for S1 to finish > > while getting the speculative insertion into foo_b_key? > > Yes. The way the patch currently handles that case is unacceptable. It > needs to be more concerned about holding an exclusive lock on > earlier-locked indexes when locking the second and subsequent indexes > iff it blocks on another transaction in the course of locking the > second and subsequent indexes. After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called "speculative insertion" don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items "insertion promises"... Index searches will ignore promises they see, but concurrent index insertions will notice it's a promise and wait for the xid (or not, if it has aborted). That fits pretty well into the existing btree code. If the insertion of an index promise worked for all unique indexes and the heap tuple has been inserted, convert those promises into actual item pointers pointing to the heap tuple. That probably requires a second search and some cuteness to find the correct pointer because of concurrent splits, but that's not too bad (possibly you can do away with that if we continue to hold a pin). The fact that now xids can be in the index creates some slight complications because it now needs to be vacuumed - but that's relatively easy to fit into the bulkdelete api and I don't think the contortions to the vacuum code will be too bad. Imo that solution is fairly elegant because it doesn't cause any heap bloat, and it causes the same amount of index bloat "insert-into-heap-first" type of solutions cause. It also has pretty good concurrency behaviour because you never need to retain a page lock for longer than a normal index insertion. It's certainly a bit more complex than your solution tho... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: