Re: Out of Memory and Configuration Problems (Big Computer) - Mailing list pgsql-general
From | Tom Wilcox |
---|---|
Subject | Re: Out of Memory and Configuration Problems (Big Computer) |
Date | |
Msg-id | 4C040A9A.8020207@gmail.com Whole thread Raw |
In response to | Re: Out of Memory and Configuration Problems (Big Computer) (Bill Moran <wmoran@potentialtech.com>) |
Responses |
Re: Out of Memory and Configuration Problems (Big Computer)
|
List | pgsql-general |
I am having difficulties. I have rerun my update that uses the python functions.. (1) UPDATE nlpg.match_data SET org = normalise(org); And some other similar queries on neighbouring fields in the table. They have all now worked. Without any changes to the configuration. I have done one thing in an attempt to minimise the risk of memory leak normalise() I added "toks = None" to the end of the normalise() function. However this was done after query (1) succeeded on the rerun. Why would I get inconsistent behaviour? Would it have anything to do with SQL Server running on the same machine (although not actually doing anything at the moment - just idle server running in background). Tangent: Is there any way to increase the memory allocated to postgres by Windows using Job Objects? Cheers, Tom On 29/05/2010 18:55, Bill Moran wrote: > On 5/28/10 8:43:48 PM, Tom Wilcox wrote: >> I ran this query: >> >> EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org; >> >> And I got this result: >> >> "Seq Scan on match_data (cost=0.00..9762191.68 rows=32205168 width=206) >> (actual time=76873.592..357450.519 rows=27777961 loops=1)" >> "Total runtime: 8028212.367 ms" > > That would seem to indicate that the problem is in your Python > functions. > > Some ideas for next steps: > * Perhaps it's just a few rows that have data in them that the > function has difficulty with. Add some debugging/logging to > the function and see if the row it bombs on has anything unusual > in it (such as a very large text field) > * While large, that function is fairly simplistic. You may want > to consider rewriting it as an SQL function, which should be > more efficient in any event. > >> >> >> On 28 May 2010 19:39, Tom Wilcox <hungrytom@googlemail.com >> <mailto:hungrytom@googlemail.com>> wrote: >> >> Oops. Sorry about that. >> >> I am having this problem with multiple queries however I am >> confident that a fair number may involve the custom plpython >> "normalise" function which I have made myself. I didn't think it >> would be complicated enough to produce a memory problem.. here it >> is: >> >> -- Normalises common address words (i.e. 'Ground' maps to 'grd') >> CREATE OR REPLACE FUNCTION normalise(s text) RETURNS text AS $$ >> ADDR_FIELD_DELIM = ' ' >> >> # Returns distinct list without null or empty elements >> def distinct_str(list): >> seen = set() >> return [x for x in list if x not in seen and not seen.add(x) >> and x!=None and len(x)>0] >> >> # normalise common words in given address string >> def normalise(match_data): >> if match_data==None: return '' >> import re >> # Tokenise >> toks = distinct_str(re.split(r'\s', match_data.lower())) >> out = '' >> for tok in toks: >> ## full word replace >> if tok == 'house' : out += 'hse'+ADDR_FIELD_DELIM >> elif tok == 'ground' : out += 'grd'+ADDR_FIELD_DELIM >> elif tok == 'gnd' : out += 'grd'+ADDR_FIELD_DELIM >> elif tok == 'front' : out += 'fnt'+ADDR_FIELD_DELIM >> elif tok == 'floor' : out += 'flr'+ADDR_FIELD_DELIM >> elif tok == 'floors' : out += 'flr'+ADDR_FIELD_DELIM >> elif tok == 'flrs' : out += 'flr'+ADDR_FIELD_DELIM >> elif tok == 'fl' : out += 'flr'+ADDR_FIELD_DELIM >> elif tok == 'basement' : out += 'bst'+ADDR_FIELD_DELIM >> elif tok == 'subbasement' : out += 'sbst'+ADDR_FIELD_DELIM >> elif tok == 'bsmt' : out += 'bst'+ADDR_FIELD_DELIM >> elif tok == 'lbst' : out += 'lower bst'+ADDR_FIELD_DELIM >> elif tok == 'street' : out += 'st'+ADDR_FIELD_DELIM >> elif tok == 'road' : out += 'rd'+ADDR_FIELD_DELIM >> elif tok == 'lane' : out += 'ln'+ADDR_FIELD_DELIM >> elif tok == 'rooms' : out += 'rm'+ADDR_FIELD_DELIM >> elif tok == 'room' : out += 'rm'+ADDR_FIELD_DELIM >> elif tok == 'no' : pass >> elif tok == 'number' : pass >> elif tok == 'and' : out += '&'+ADDR_FIELD_DELIM >> elif tok == 'rear' : out += 'rr'+ADDR_FIELD_DELIM >> elif tok == 'part' : out += 'pt'+ADDR_FIELD_DELIM >> elif tok == 'south' : out += 's'+ADDR_FIELD_DELIM >> elif tok == 'sth' : out += 's'+ADDR_FIELD_DELIM >> elif tok == 'north' : out += 'n'+ADDR_FIELD_DELIM >> elif tok == 'nth' : out += 'n'+ADDR_FIELD_DELIM >> elif tok == 'west' : out += 'w'+ADDR_FIELD_DELIM >> elif tok == 'wst' : out += 'w'+ADDR_FIELD_DELIM >> elif tok == 'east' : out += 'e'+ADDR_FIELD_DELIM >> elif tok == 'est' : out += 'e'+ADDR_FIELD_DELIM >> elif tok == 'first' : out += '1st'+ADDR_FIELD_DELIM >> elif tok == 'second' : out += '2nd'+ADDR_FIELD_DELIM >> elif tok == 'third' : out += '3rd'+ADDR_FIELD_DELIM >> elif tok == 'fourth' : out += '4th'+ADDR_FIELD_DELIM >> elif tok == 'fifth' : out += '5th'+ADDR_FIELD_DELIM >> elif tok == 'sixth' : out += '6th'+ADDR_FIELD_DELIM >> elif tok == 'seventh' : out += '7th'+ADDR_FIELD_DELIM >> elif tok == 'eighth' : out += '8th'+ADDR_FIELD_DELIM >> elif tok == 'ninth' : out += '9th'+ADDR_FIELD_DELIM >> elif tok == 'tenth' : out += '10th'+ADDR_FIELD_DELIM >> elif tok == 'eleventh' : out += '11th'+ADDR_FIELD_DELIM >> elif tok == 'twelfth' : out += '12th'+ADDR_FIELD_DELIM >> elif tok == 'thirteenth' : out += '13th'+ADDR_FIELD_DELIM >> elif tok == 'fourteenth' : out += '14th'+ADDR_FIELD_DELIM >> elif tok == 'fifteenth' : out += '15th'+ADDR_FIELD_DELIM >> elif tok == 'sixteenth' : out += '16th'+ADDR_FIELD_DELIM >> elif tok == 'seventeenth' : out += '17th'+ADDR_FIELD_DELIM >> elif tok == 'eighteenth' : out += '18th'+ADDR_FIELD_DELIM >> elif tok == 'nineteenth' : out += '19th'+ADDR_FIELD_DELIM >> elif tok == 'twentieth' : out += '20th'+ADDR_FIELD_DELIM >> # numbers 0 - 20 >> elif tok == 'one' : out += '1'+ADDR_FIELD_DELIM >> elif tok == 'two' : out += '2'+ADDR_FIELD_DELIM >> elif tok == 'three' : out += '3'+ADDR_FIELD_DELIM >> elif tok == 'four' : out += '4'+ADDR_FIELD_DELIM >> elif tok == 'five' : out += '5'+ADDR_FIELD_DELIM >> elif tok == 'six' : out += '6'+ADDR_FIELD_DELIM >> elif tok == 'seven' : out += '7'+ADDR_FIELD_DELIM >> elif tok == 'eight' : out += '8'+ADDR_FIELD_DELIM >> elif tok == 'nine' : out += '9'+ADDR_FIELD_DELIM >> elif tok == 'ten' : out += '10'+ADDR_FIELD_DELIM >> elif tok == 'eleven' : out += '11'+ADDR_FIELD_DELIM >> elif tok == 'twelve' : out += '12'+ADDR_FIELD_DELIM >> elif tok == 'thirteen' : out += '13'+ADDR_FIELD_DELIM >> elif tok == 'fourteen' : out += '14'+ADDR_FIELD_DELIM >> elif tok == 'fifteen' : out += '15'+ADDR_FIELD_DELIM >> elif tok == 'sixteen' : out += '16'+ADDR_FIELD_DELIM >> elif tok == 'seventeen' : out += '17'+ADDR_FIELD_DELIM >> elif tok == 'eighteen' : out += '18'+ADDR_FIELD_DELIM >> elif tok == 'nineteen' : out += '19'+ADDR_FIELD_DELIM >> elif tok == 'twenty' : out += '20'+ADDR_FIELD_DELIM >> # town dictionary items >> elif tok == 'borough' : pass >> elif tok == 'city' : pass >> elif tok == 'of' : pass >> elif tok == 'the' : pass >> # a few extras (from looking at voa) >> elif tok == 'at' : pass >> elif tok == 'incl' : pass >> elif tok == 'inc' : pass >> else: out += tok+ADDR_FIELD_DELIM >> return out >> >> return normalise(s) >> $$ LANGUAGE plpythonu; >> >> >> Here's the create script for the table from pgAdmin (I hope that >> will be good enough instead of \d as I can't do that right now).. >> >> -- Table: nlpg.match_data >> >> -- DROP TABLE nlpg.match_data; >> >> CREATE TABLE nlpg.match_data >> ( >> premise_id integer, >> usrn bigint, >> org text, >> sao text, >> "level" text, >> pao text, >> "name" text, >> street text, >> town text, >> pc postcode, >> postcode text, >> match_data_id integer NOT NULL DEFAULT >> nextval('nlpg.match_data_match_data_id_seq1'::regclass), >> addr_str text, >> tssearch_name tsvector, >> >> CONSTRAINT match_data_pkey1 PRIMARY KEY (match_data_id) >> ) >> WITH ( >> OIDS=FALSE >> ); >> ALTER TABLE nlpg.match_data OWNER TO postgres; >> ALTER TABLE nlpg.match_data ALTER COLUMN "name" SET STATISTICS >> 10000; >> >> >> -- Index: nlpg.index_match_data_mid >> >> -- DROP INDEX nlpg.index_match_data_mid; >> >> CREATE INDEX index_match_data_mid >> ON nlpg.match_data >> USING btree >> (match_data_id); >> >> -- Index: nlpg.index_match_data_pc >> >> -- DROP INDEX nlpg.index_match_data_pc; >> >> CREATE INDEX index_match_data_pc >> ON nlpg.match_data >> USING btree >> (pc); >> >> -- Index: nlpg.index_match_data_pid >> >> -- DROP INDEX nlpg.index_match_data_pid; >> >> CREATE INDEX index_match_data_pid >> ON nlpg.match_data >> USING btree >> (premise_id); >> >> -- Index: nlpg.index_match_data_tssearch_name >> >> -- DROP INDEX nlpg.index_match_data_tssearch_name; >> >> CREATE INDEX index_match_data_tssearch_name >> ON nlpg.match_data >> USING gin >> (tssearch_name); >> >> -- Index: nlpg.index_match_data_usrn >> >> -- DROP INDEX nlpg.index_match_data_usrn; >> >> CREATE INDEX index_match_data_usrn >> ON nlpg.match_data >> USING btree >> (usrn); >> >> As you can see, no FKs or triggers.. >> >> I am running: EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org; >> >> However, as it should take around 90mins (if it is linear) then I >> thought I would send this now and follow up with the results once it >> finishes. (Has taken 2hours so far..) >> >> Thanks very much for your help. >> >> Tom >> >> >> On 28 May 2010 17:54, "Bill Moran" <wmoran@potentialtech.com >> <mailto:wmoran@potentialtech.com>> wrote: >>> >>> In response to Tom Wilcox <hungrytom@googlemail.com >>> <mailto:hungrytom@googlemail.com>>: >>> >>> > In addition, I have discovered that the update query that runs >>> on each row >>> > of a 27million row ta... >>> >>> You're not liable to get shit for answers if you omit the mailing >>> list from >>> the conversation, especially since I know almost nothing about >>> tuning >>> PostgreSQL installed on Windows. >>> >>> Are there multiple queries having this problem? The original >>> query didn't >>> have normalise() in it, and I would be highly suspicious that a >>> custom >>> function may have a memory leak or other memory-intensive >>> side-effects. >>> What is the code for that function? >>> >>> For example, does: >>> UPDATE nlpg.match_data SET org = org WHERE match_data_id; >>> finish in a reasonable amount of time or exhibit the same out of >>> memory >>> problem? >>> >>> It'd be nice to see a \d on that table ... does it have any >>> triggers or >>> cascading foreign keys? >>> >>> And stop >>> >>> -- >>> >>> Bill Moran >>> http://www.potentialtech.com >>> http://people.collaborativefusion.com/~wmoran/ >>> <http://people.collaborativefusion.com/%7Ewmoran/> >>> >> > >
pgsql-general by date: