Re: extracting location info from string - Mailing list pgsql-sql
From | Rob Sargent |
---|---|
Subject | Re: extracting location info from string |
Date | |
Msg-id | 4DDD1F7C.5040008@gmail.com Whole thread Raw |
In response to | Re: extracting location info from string (Lew <noone@lewscanon.com>) |
Responses |
Re: extracting location info from string
|
List | pgsql-sql |
On 05/24/2011 10:57 AM, Lew wrote: > Tarlika Elisabeth Schmitz wrote: >> Lew wrote: >>> That isn't a table structure, that's a freeform text structure. You >>> didn't state your question, Tarlika, but your database structure is >>> terrible. For example, "region" and "country" should be different >>> columns. > >> I presume you are referring to my original post: >> CREATE TABLE person >> ( >> id integer NOT NULL, >> "name" character varying(256) NOT NULL, >> "location" character varying(256), >> CONSTRAINT person_pkey PRIMARY KEY (id) >> ); >> >> Sorry, this was just a TEMPORARY table I created for quick analysis of >> my CSV data (now renamed to temp_person). >> >> >> >> The target table is: >> CREATE TABLE person >> ( >> id integer NOT NULL, >> "name" character varying(100) NOT NULL, >> country character varying(3), >> county character varying(3), >> town character varying(50), >> CONSTRAINT trainer_pkey PRIMARY KEY (id), >> CONSTRAINT country_person_fk FOREIGN KEY (country) >> REFERENCES country (id) MATCH SIMPLE >> ON UPDATE CASCADE ON DELETE RESTRICT, >> CONSTRAINT county_person_fk FOREIGN KEY (country, county) >> REFERENCES county (country, code) MATCH SIMPLE >> ON UPDATE NO ACTION ON DELETE NO ACTION, >> ); > > Ah, yes, that makes much more sense. Temporary tables such as you > describe can be very convenient and effective. Thanks for the > clarification. > > I think this problem is very widespread, namely how to get structured > information out of freeform data. I've encountered it many times over > the years, as have so many I know. I believe that human intervention > will always be needed for this type of work, e.g., distinguishing place > names that seem the same or correlating ones that seem distinct. I also > don't know of any perfect approach. Perhaps the best one can find is a > probabilistic promise that error will be less than some epsilon. > > That said, if you have a robust process to correct errors as the user > population discovers them, then you can approach perfection > asymptotically. Sometimes the best solution to a technical problem is a > good human process. From an engineering standpoint, user feedback is a > vital element of homeostatic control. > > Edward W. Rouse's suggestion of a reference table to resolve different > forms of address or region identification would fit well with such a > process. > To minimize the ultimately quite necessary human adjudication, one might make good use of what is often termed "crowd sourcing": Keep all the distinct "hand entered" values and a map to the final human assessment. At least repeated oddities won't repeatedly involvehand editing.