Thread: Searching data across tables, some large
Thanks to Richard for the help earlier this week with performance questions, once I was able to get pgsql and mssql using the same resources and doing some tuning, I was able to get comparable results. The issue still though, I have this view that I designed with the thought in mind to provide all fields the user will want to search in the PHP web app. However, I have found when joining a couple of large tables, the view count reaches the 40+ million records, this does not seem to work in pgsql (or in mssql), especially if distinct is needed. Should I even be attempting this approach? I have never used TSearch and wonder if that is the solution to this type of search? From a quick read of some TSearch info I see indexes are setup on a column basis in a table? So, I could create an index column for say the first name and last name fields in a contact table, but cannot create a field with information from different tables? Also, not all fields are text that I need to search, some boolean for instance, can I search TSearch index fields and other fields at the same time. Not sure how TSearch works, just wanted to get an opinion that that may be what I need before delving into it too much. -- Robert
Yeah, we've used Tsearch with joins and searches on other fields on the tsearch table no problem. Tsearches are just another part of a WHERE clause. Regards - Ericson Smith Developer http://www.funadvice.com On 5/23/07, Robert Fitzpatrick <lists@webtent.net> wrote: > Thanks to Richard for the help earlier this week with performance > questions, once I was able to get pgsql and mssql using the same > resources and doing some tuning, I was able to get comparable results. > > The issue still though, I have this view that I designed with the > thought in mind to provide all fields the user will want to search in > the PHP web app. However, I have found when joining a couple of large > tables, the view count reaches the 40+ million records, this does not > seem to work in pgsql (or in mssql), especially if distinct is needed. > Should I even be attempting this approach? > > I have never used TSearch and wonder if that is the solution to this > type of search? From a quick read of some TSearch info I see indexes are > setup on a column basis in a table? So, I could create an index column > for say the first name and last name fields in a contact table, but > cannot create a field with information from different tables? Also, not > all fields are text that I need to search, some boolean for instance, > can I search TSearch index fields and other fields at the same time. Not > sure how TSearch works, just wanted to get an opinion that that may be > what I need before delving into it too much. > > -- > Robert > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Greetings all, I have a couple issues regarding geographic names databases. 1) The first is this. I have 3 tables. Country, state and city. Country has a country_id to identify a country, state has a state_id and country_id to identify a state, and city has a city_id, state_id and country_id (for easy reference) to identify it. I then have a table for users that stores their city, state and country ID's along with other info about them. My problem came recently when I questioned the integrity of the data and needed to make some changes. I thought to myself that maybe storing the ID wasn't as good as storing the ISO or FIPS 2 letter abbreviation. The only problem the abbreviation could changed at some point by the regulating bodies and all rows in all tables would need to be updated. The question is, for the purposes of querying or searching is it better to store and search a 2 byte integer that is indexed for country or state ID's, or is it better to store and search a 2 byte CHAR abbreviation? 2) I've spent an accumulated total of around a month and a half trying to consolidate geographic name data from several free sources on the net and realize this isn't the best use of my time and errors will be had. Does anyone know of a reliable source of geo data that isn't costly? Most want to charge a server license, annual rate, etc. I'm not sure about the free sources because one I used actually had mixed values in a column and drove me nuts. I primarily need: country state county if applicable city latitude longitude This is primarily input from an HTML form to calculate distances between users. Anyone who has any experience with geo name data I would appreciate hearing your solution.
On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote: > Yeah, we've used Tsearch with joins and searches on other fields on > the tsearch table no problem. Tsearches are just another part of a > WHERE clause. > And can there be Tsearch fields built based on fields in different tables? Where can I find the best docs for TSearch2? I'm looking for information pertaining to how TSearch can help in my situation, instead of building a view with a lot of joins causing absurd amount of rows returned. Or am I going to need to limit the tables a user can search at one time regardless? For instance, seems if I join my table of clients with contacts to return all the contacts with company info, no problem. I can also join my clients with activity *or* comments table. But if I try to left join all three of those tables in one view against clients, seems to return way too many rows. Of course, I will not want all those rows, but I can't get it to respond to queries. Meanwhile, I can create a view with just the comments table joined to clients, while the count is over 1 million records, select statements execute quickly. Is TSearch for me? Thanks for the advice! -- Robert
seems hard to enforce integrity in your model. how are you going to ensure that the user's city-state-country combo a valid one? (well, you can, but it is a pain). ask yourself: can a city be in more than one country? probably not (even if the name is the same it is not the same city!). can a state be in more than one country? etc., etc. seems much cleaner to have cities have a key to states, states to countries. otherwise might as well just have a big denormalized table and skip the whole relational thing... numeric ids vs chars, when properly indexed, should perform about the same (even if there is a small difference this is not something one should really worry about; hey, there aren't even that many cities in the world!) i would go with a unique internal id (in fact that IS what i do) you can store the FIPS/ISO code in a neighboring field, but i am not sure it is good enough for a primary key. > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D. > Sent: Wednesday, May 23, 2007 4:22 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Geographic data sources, queries and questions > > Greetings all, > > I have a couple issues regarding geographic names databases. > > 1) The first is this. I have 3 tables. Country, state and > city. Country has > a country_id to identify a country, state has a state_id and > country_id to > identify a state, and city has a city_id, state_id and > country_id (for easy > reference) to identify it. I then have a table for users > that stores their > city, state and country ID's along with other info about them. > > My problem came recently when I questioned the integrity of > the data and > needed to make some changes. I thought to myself that maybe > storing the ID > wasn't as good as storing the ISO or FIPS 2 letter > abbreviation. The only > problem the abbreviation could changed at some point by the > regulating bodies > and all rows in all tables would need to be updated. > > The question is, for the purposes of querying or searching is > it better to > store and search a 2 byte integer that is indexed for country > or state ID's, > or is it better to store and search a 2 byte CHAR abbreviation? > > 2) I've spent an accumulated total of around a month and a > half trying to > consolidate geographic name data from several free sources on > the net and > realize this isn't the best use of my time and errors will be > had. Does > anyone know of a reliable source of geo data that isn't > costly? Most want to > charge a server license, annual rate, etc. I'm not sure > about the free > sources because one I used actually had mixed values in a > column and drove me > nuts. I primarily need: > > country > state > county if applicable > city > latitude > longitude > > This is primarily input from an HTML form to calculate > distances between > users. > > Anyone who has any experience with geo name data I would > appreciate hearing > your solution.
"Chuck D." <pgsql-list@nullmx.com> writes: > 1) The first is this. I have 3 tables. Country, state and city. Country has > a country_id to identify a country, state has a state_id and country_id to > identify a state, and city has a city_id, state_id and country_id (for easy > reference) to identify it. I then have a table for users that stores their > city, state and country ID's along with other info about them. I don't believe this is good design. You'll have to have a trigger or something to verify that the country_id+state_id on the city table are exactly equal to the country_id+state_id on the state table. If you don't, you might have something like (using US city names...) "country: USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New York". It isn't a problem of "any country and any state" on the city table, but a problem of "this state inside that particular country". I'd drop the country column. > My problem came recently when I questioned the integrity of the data and > needed to make some changes. I thought to myself that maybe storing the ID > wasn't as good as storing the ISO or FIPS 2 letter abbreviation. The only > problem the abbreviation could changed at some point by the regulating bodies > and all rows in all tables would need to be updated. You have integrity problems because you denormalized your model too much and tried to attach the same information on two different places without requiring those to be equal. > The question is, for the purposes of querying or searching is it better to > store and search a 2 byte integer that is indexed for country or state ID's, > or is it better to store and search a 2 byte CHAR abbreviation? It all depends: surrogate primary keys or ... :-) (old flame starter) -- Jorge Godoy <jgodoy@gmail.com>
On Wed, 2007-05-23 at 19:48 -0400, Robert Fitzpatrick wrote: > On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote: > > Yeah, we've used Tsearch with joins and searches on other fields on > > the tsearch table no problem. Tsearches are just another part of a > > WHERE clause. > > > > And can there be Tsearch fields built based on fields in different > tables? Where can I find the best docs for TSearch2? I'm looking for > information pertaining to how TSearch can help in my situation, instead > of building a view with a lot of joins causing absurd amount of rows > returned. Of course, what was I thinking! After reading through the TSearch stuff I see the objective is to index multiple columns. There would be no way to reference a unique record from an index of columns over multiple tables. So, it is looking like a build of the query string is going to have to take place on whichever tables the fields are in while limiting what fields the user can search (depending whether fields are in different tables), that versus a view with everything is what I should be doing? -- Robert
On Wednesday 23 May 2007 17:56, you wrote: > seems hard to enforce integrity in your model. how are you going to > ensure that the user's city-state-country combo a valid one? (well, you > can, but it is a pain). ask yourself: can a city be in more than one > country? probably not (even if the name is the same it is not the same > city!). can a state be in more than one country? etc., etc. No, a city state combination belongs to one country as far as I know. I really just left the country id in the city table as an easy means of getting around while I was trying to understand the different data sources. I was doing a lot of typing at the console and it was easier than joins :) It would be dropped for sure as it adds another 4 MEGS to the table not including indexes, and it violates BC normal form. > > numeric ids vs chars, when properly indexed, should perform about the > same (even if there is a small difference this is not something one > should really worry about; hey, there aren't even that many cities in > the world!) Ya, about 2 million city entries total. > > i would go with a unique internal id (in fact that IS what i do) you can > store the FIPS/ISO code in a neighboring field, but i am not sure it is > good enough for a primary key. OK, cool. I *prefer* the id as that is what I've been used to, but when I used oracle in school we would have been taught to use the CHAR. I just don't like using the CHAR because some data sets use FIPS and some use ISO and both are bound to change. After all those are just abbreviations. Seems like something to be referenced. I think I'll stick with the ID's instead of CHARs then as they seem more comfortable. Speaking of which, if you are using a similar database what source did you use for geographic data? I'm having troubles with a reliable set. > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org > > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D. > > Sent: Wednesday, May 23, 2007 4:22 PM > > To: pgsql-general@postgresql.org > > Subject: [GENERAL] Geographic data sources, queries and questions > > > > Greetings all, > > > > I have a couple issues regarding geographic names databases. > > > > 1) The first is this. I have 3 tables. Country, state and > > city. Country has > > a country_id to identify a country, state has a state_id and > > country_id to > > identify a state, and city has a city_id, state_id and > > country_id (for easy > > reference) to identify it. I then have a table for users > > that stores their > > city, state and country ID's along with other info about them. > > > > My problem came recently when I questioned the integrity of > > the data and > > needed to make some changes. I thought to myself that maybe > > storing the ID > > wasn't as good as storing the ISO or FIPS 2 letter > > abbreviation. The only > > problem the abbreviation could changed at some point by the > > regulating bodies > > and all rows in all tables would need to be updated. > > > > The question is, for the purposes of querying or searching is > > it better to > > store and search a 2 byte integer that is indexed for country > > or state ID's, > > or is it better to store and search a 2 byte CHAR abbreviation? > > > > 2) I've spent an accumulated total of around a month and a > > half trying to > > consolidate geographic name data from several free sources on > > the net and > > realize this isn't the best use of my time and errors will be > > had. Does > > anyone know of a reliable source of geo data that isn't > > costly? Most want to > > charge a server license, annual rate, etc. I'm not sure > > about the free > > sources because one I used actually had mixed values in a > > column and drove me > > nuts. I primarily need: > > > > country > > state > > county if applicable > > city > > latitude > > longitude > > > > This is primarily input from an HTML form to calculate > > distances between > > users. > > > > Anyone who has any experience with geo name data I would > > appreciate hearing > > your solution.
On Wednesday 23 May 2007 18:59, you wrote: > > I don't believe this is good design. You'll have to have a trigger or > something to verify that the country_id+state_id on the city table are > exactly equal to the country_id+state_id on the state table. If you > don't, you might have something like (using US city names...) "country: > USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New > York". > > It isn't a problem of "any country and any state" on the city table, but > a problem of "this state inside that particular country". I'd drop the > country column. You are right, this is a bad design. The country_id on the city table has to go.
* Chuck D. <pgsql-list@nullmx.com> [20070524 01:26]: > 2) I've spent an accumulated total of around a month and a half trying to > consolidate geographic name data from several free sources on the net and > realize this isn't the best use of my time and errors will be had. Does > anyone know of a reliable source of geo data that isn't costly? Most want to > charge a server license, annual rate, etc. I'm not sure about the free > sources because one I used actually had mixed values in a column and drove me > nuts. I primarily need: > > country > state > county if applicable > city > latitude > longitude > > This is primarily input from an HTML form to calculate distances between > users. > > Anyone who has any experience with geo name data I would appreciate hearing > your solution. We are using this data which seems to be fairly extensive and accurate, and is free: http://earth-info.nga.mil/gns/html/gis_countryfiles.htm I haven't fully understood the meaning of all the fiels in there however, we're using it only to compute alternative spellings for city and country names and came up with these conditions that seem to return the desired results: For city alternatives: select lower(full_name) as full_name from geo_names gn1 where gn1.ufi in (select ufi from geo_names gn2 where lower(gn2.full_name)=:city and gn2.fc='P') For country alternatives: select lower(full_name) as full_name from geo_names gn1 where gn1.ufi in (select ufi from geo_names gn2 where lower(gn2.full_name)=:country and gn2.fc='A' and gn2.dsg='PCLI') Til
Tilmann Singer wrote: > We are using this data which seems to be fairly extensive and > accurate, and is free: > > http://earth-info.nga.mil/gns/html/gis_countryfiles.htm We use that, but it is only non-US, so we combine it with this: http://geonames.usgs.gov/domestic/download_data.htm We also have a hodge-podge of other sources, but those are the main ones. (By the way, we have found USGS to very amenable to dumping their data in arbitrary ways. Those state files essentially try to fit everything into a single CSV format, but they have given us other custom dumps.) Note that both of these main sources have multiple names for the same location, so our schema is highly normalized - we have a separate table for names (so the string "Springfield" occurs in only one place :). Because we are interested in all sorts of geographic entities, not just city/state/country, we have only a single table for these, with fields for type, lat/long, primary name, and a few other things. All other relationships are represented in separate linking tables, using our internal IDs for locations and names, e.g., location_has_name, location_contained_in_location, etc. As far as FIPS and ISO codes are concerned, we have a separate table mapping (locationID, standards body) to codes. We are interested in sharing this stuff, so I'd be happy to pass along the schema and/or the data, although all of it is kind of beta. - John D. Burger MITRE
On Thursday 24 May 2007 13:02, John D. Burger wrote: > > We also have a hodge-podge of other sources, but those are the main > ones. (By the way, we have found USGS to very amenable to dumping > their data in arbitrary ways. Those state files essentially try to > fit everything into a single CSV format, but they have given us other > custom dumps.) > > Note that both of these main sources have multiple names for the same > location, so our schema is highly normalized - we have a separate > table for names (so the string "Springfield" occurs in only one > place :). Because we are interested in all sorts of geographic > entities, not just city/state/country, we have only a single table > for these, with fields for type, lat/long, primary name, and a few > other things. All other relationships are represented in separate > linking tables, using our internal IDs for locations and names, e.g., > location_has_name, location_contained_in_location, etc. As far as > FIPS and ISO codes are concerned, we have a separate table mapping > (locationID, standards body) to codes. > > We are interested in sharing this stuff, so I'd be happy to pass > along the schema and/or the data, although all of it is kind of beta. John, I'd be happy to take a look at the schema and data. I decided to put together the USGS stuff, the maxmind free stuff and the GeoNames project files and in the end I had countries with no states, states with no cities and cities with no states. Some data sources said a country had 40 states, another said it had 50. It was difficult to try and figure out because I don't know geo stuff enough to verify it. How can I go about getting a peek at the schema and the sources at least?
Chuck D. wrote: > I decided to put together the USGS stuff, the maxmind free stuff > and the > GeoNames project files and in the end I had countries with no > states, states > with no cities and cities with no states. Some data sources said a > country > had 40 states, another said it had 50. It was difficult to try and > figure > out because I don't know geo stuff enough to verify it. Yeah, all of our sources data has various degrees of noise. There were even locations mis-typed as =countries= in the official NGA downloads - you'd think their validation would at least identify spurious countries :). We developed a set of heuristics for deciding when two locations (usually but not always from two different sources) were in fact the same entity. This was an area that needed more work, however, when the project ended. In addition, different sources had made different ontological decisions about what was what. For instance, does the US have 50 states - what about the US Virgin Islands, etc? This was a few years ago - if we were to start up again, I suspect we would investigate working with whoever is behind geonames.org, as they seem to have the same kind of goals we did. Anyway, I will send our schema under separate cover, and I will investigate sending you the data as well. - John D. Burger MITRE
John D. Burger wrote: > For instance, does the US have 50 states - what about the US Virgin Islands, etc? Off-topic, but the US Virgin Islands are an "unincorporated United States insular area" ("territory"--note the lowercase t). The Dept. Of the Interrior addresses this in their FAQ: http://www.doi.gov/oia/FAQ/FAQindex.htm#4 You'll be surprised and amazed at the number of US "terms of sovereignty" that exist. b
> Off-topic, but the US Virgin Islands are an "unincorporated United > States insular area" ("territory"--note the lowercase t). The Dept. > Of the Interrior addresses this in their FAQ: > > http://www.doi.gov/oia/FAQ/FAQindex.htm#4 > > You'll be surprised and amazed at the number of US "terms of > sovereignty" that exist. Yah, that's my point - some data sources might lump all these together as state/province level entities, and some might not. - John D. Burger MITRE
You can try the free sample database from http://www.geodatasource.com
On May 24, 2007 01:02:42 pm John D. Burger wrote: > Tilmann Singer wrote: > > We are using this data which seems to be fairly extensive and > > accurate, and is free: > > > > http://earth-info.nga.mil/gns/html/gis_countryfiles.htm > > We use that, but it is only non-US, so we combine it with this: > > http://geonames.usgs.gov/domestic/download_data.htm > > We also have a hodge-podge of other sources, but those are the main > ones. (By the way, we have found USGS to very amenable to dumping > their data in arbitrary ways. Those state files essentially try to > fit everything into a single CSV format, but they have given us other > custom dumps.) > > Note that both of these main sources have multiple names for the same > location, so our schema is highly normalized - we have a separate > table for names (so the string "Springfield" occurs in only one > place :). Because we are interested in all sorts of geographic > entities, not just city/state/country, we have only a single table > for these, with fields for type, lat/long, primary name, and a few > other things. All other relationships are represented in separate > linking tables, using our internal IDs for locations and names, e.g., > location_has_name, location_contained_in_location, etc. As far as > FIPS and ISO codes are concerned, we have a separate table mapping > (locationID, standards body) to codes. > > We are interested in sharing this stuff, so I'd be happy to pass > along the schema and/or the data, although all of it is kind of beta. > > - John D. Burger > MITRE This has been a while since I've written discussing this but I am looking for some help. Has anyone successfully imported any of the newer http://earth-info.nga.mil/gns/html/gis_countryfiles.htm world cities files? I keep getting problems with the import using COPY. Just when I thought I'd solved the problem of the ^M$ in some fields (notably China's adm2's) I've come across a problem with cc1=SG in which there appears to be extra ^I tabs. Anyone have success with these? What preprocessing needs to be done to these files to get them to COPY correctly?