Re: Natural key woe - Mailing list pgsql-general
From | Robin |
---|---|
Subject | Re: Natural key woe |
Date | |
Msg-id | BLU436-SMTP170054D0C7C1930D3611C0FE2340@phx.gbl Whole thread Raw |
In response to | Natural key woe (Oliver Kohll - Mailing Lists <oliver.lists@agilebase.co.uk>) |
Responses |
Re: Natural key woe
|
List | pgsql-general |
Oliver I've read your email, with interest. I haven't had to deal with this sort of problem in PostgreSQL, but I have frequently dealt with it in a Sybase environment, first encountered about 25 years ago. I am most curious to know why you didn't use the same sequence for both tables, I must be missing something. If there is a gotcha, I'd like to know about it as I can see this being an issue in a load sharing environment. Many years ago, before auto sequencing was common, we set up explicit functions to generate sequwnce numbers. Whilst this had some perormance costs in multi-user systems, it did have benefits in terms of making it easier to restrict the need for row locking to the underlying data table. Robin St.Clair On 13/05/2014 10:44, Oliver Kohll - Mailing Lists wrote: > I'm sure no one else on this list has done anything like this, but here's a cautionary tale. > > I wanted to synchronise data in two tables (issue lists) - i.e. whenever a record is added into one, add a similar recordinto the other. The two tables are similar in format but not exactly the same so only a subset of fields are copied.Both tables have synthetic primary keys, these can't be used to match data as they are auto-incrementing sequencesthat might interfere. What I could have done perhaps is get both tables to use the same sequence, but what I actuallydid is: > > * join both tables based on a natural key > * use that to copy any missing items from table1 to table2 > * truncate table1 and copy all of table2's rows to table1 > * run this routine once an hour > > The natural key was based on the creation timestamp (stored on insert) and the one of the text fields, called 'subject'. > > The problem came when someone entered a record with no subject, but left it null. When this was copied over and presentin both tables, the *next* time the join was done, a duplicate was created because the join didn't see them as matching(null != null). > > So after 1 hour there were two records. After two there were four, after 3, 8 etc. > > When I logged in after 25 hrs and noticed table access was a little slow, there were 2^25 = 33 million records. > > That's a learning experience for me at least. It's lucky I did check it at the end of that day rather than leaving it overnight,otherwise I think our server would have ground to a halt. > > One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX tabletable2', some queries with simple sequence scans were taking a few seconds to run even though there are only a thousandrows in the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an SSDso I would have thought seeking all over the place for a seq. scan wouldn't have made that much difference. It obviouslydoes still make some. > > Oliver Kohll > www.agilebase.co.uk > > > > > >
pgsql-general by date: