Re: Doing better at HINTing an appropriate column within errorMissingColumn() - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Doing better at HINTing an appropriate column within errorMissingColumn() |
Date | |
Msg-id | CAM3SWZRhLLdPD0JHQeXQpVVv54CC_j37t8T-oz5f3SxystShHw@mail.gmail.com Whole thread Raw |
In response to | Re: Doing better at HINTing an appropriate column within errorMissingColumn() (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
|
List | pgsql-hackers |
On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Basically, the case in which I think it's helpful to issue a > suggestion here is when the user has used the table name rather than > the alias name. I wonder if it's worth checking for that case > specifically, in lieu of what you've done here, and issuing a totally > different hint in that case ("HINT: You must refer to this as column > as "prime_minister.id" rather than "cameron.id"). Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. > Another idea, which I think I like less well, is to check the > Levenshtein distance between the allowed alias and the entered alias > and, if that's within the half-the-shorter-length threshold, consider > possible matches from that RTE, charge the distance between the > correct alias and the entered alias as a penalty to each potential > column match. I don't about that either. Aliases are often totally arbitrary, particularly for ad-hoc queries, which is what this is aimed at. > What I think won't do is to look at a situation where the user has > entered automobile.id and suggest that maybe they meant student.iq, or > even student.id. I'm not sure I follow. If there is an automobile.ip, then it will be suggested. If there is no automobile column that's much of a match (so no "automobile.ip", say), then student.id will be suggested (and not student.iq, *even if there is no student.id* - the final quality check saves us). So this is possible: postgres=# select iq, * from student, automobile; ERROR: 42703: column "iq" does not exist LINE 1: select iq, * from student, automobile; ^ HINT: Perhaps you meant to reference the column "student"."id". postgres=# select automobile.iq, * from student, automobile; ERROR: 42703: column automobile.iq does not exist LINE 1: select automobile.iq, * from student, automobile; ^ (note that using the table name makes us *not* see a suggestion where we otherwise would). The point is that there is a fixed penalty for a wrong user-specified alias, but all relation RTEs are considered. > The amount of difference between the names has got to > matter for the RTE names, just as it does for the column names. I think it makes sense that it matters by a fixed amount. Besides, this seems complicated enough already - I don't won't to add more complexity to worry about equidistant (but still actually valid) RTE/table/alias names. It sounds like your concern here is mostly a concern about the relative distance among multiple matches, as opposed to the absolute quality of suggestions. The former seems a lot less controversial than the latter was, though - the user always gets the best match, or the join pair of best matches, or no match when this new hinting mechanism is involved. I attach a new revision. The revision: * Uses default costs for Levenshtein distance. * Still charges extra for a non-alias-matching match (although it only charges a fixed distance of 1 extra). This has regression test coverage. * Applies a generic final quality check that enforces a requirement that a hint have a distance of no greater than 50% of the total string size. No special treatment of shorter strings is involved anymore. * Moves almost everything out of scanRTEForColumn() as you outlined (into a new function, updateFuzzyAttrMatchState(), per your suggestion). * Moves dropped column detection into updateFuzzyAttrMatchState(), per your suggestion. * Still does the "if (rte->rtekind == RTE_JOIN)" thing in the existing function searchRangeTableForCol(). I am quite confident that a suggestion from a join RTE will never be useful, to either the existing use of searchRangeTableForCol() or this expanded use, and it makes more sense to me to put it there. In fact, the existing use of searchRangeTableForCol() is really rather similar to this, and will give up on the first identical match (which is taken as evidence that there is a attribute of that name, but isn't visible at this level of the query). So I have not followed your suggestion here. Thoughts? -- Peter Geoghegan
Attachment
pgsql-hackers by date: