Re: Doing better at HINTing an appropriate column within errorMissingColumn() - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Doing better at HINTing an appropriate column within errorMissingColumn() |
Date | |
Msg-id | CAM3SWZRx=rc+53vYsgGnAShU69unjLmqTQtPqe+4jKjkRCeeNw@mail.gmail.com Whole thread Raw |
In response to | Re: Doing better at HINTing an appropriate column within errorMissingColumn() (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
|
List | pgsql-hackers |
On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm grumpy about the distanceName() function. That seems too generic. > If we're going to keep this as it is, I suggest something like > computeRTEColumnDistance(). But see below. Fair point. > On a related note, I'm also grumpy about this comment: > > + /* Charge half as much per deletion as per insertion or per substitution */ > + return varstr_levenshtein_less_equal(actual, len, match, match_len, > + 2, 1, 2, max); > > The purpose of a code comment is to articulate WHY we did something, > rather than simply to restate what the code quite obviously does. I > haven't heard a compelling argument for why this should be 2, 1, 2 > rather than the default 1, 1, 1; and I'm inclined to do the latter > unless you can make some very good argument for this combination of > weights. And if you can make such an argument, then there should be > comments so that the next person to come along and look at this code > doesn't go, huh, that's whacky, and change it. Okay. I agree that that deserves a comment. The actual argument for this formulation is that it just seems to work better that way. For example: """ postgres=# \d orderlines Table "public.orderlines" Column | Type | Modifiers -------------+----------+-----------orderlineid | integer | not nullorderid | integer | not nullprod_id | integer | not nullquantity | smallint | not nullorderdate | date | not null postgres=# select qty from orderlines ; ERROR: 42703: column "qty" does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column "orderlines"."quantity". """ The point is that the fact that the user supplied "qty" string has so many fewer characters than what was obviously intended - "quantity" - deserves to be weighed less. If you change the costing to weigh character deletion as being equal to substitution/addition, this example breaks. I also think it's pretty common to have noise words in every attribute (e.g. every column in the "orderlines" table matches "orderlines_*"), which might otherwise mess things up by overcharging for deletion. Having extra characters in the correctly spelled column name seems legitimately less significant to me. Or, in other words: having actual characters from the misspelling match the correct spelling (and having actual characters given not fail to match) is most important. What was given by the user is more important than what was not given but should have been, which is not generally true for uses of Levenshtein distance. I reached this conclusion through trying out the patch with a couple of real schemas, and seeing what works best. It's hard to express that idea tersely, in a comment, but I guess I'll try. > + int location, FuzzyAttrMatchState *rtestate) > > I suggest calling this "fuzzystate" rather than "rtestate"; it's not > the state of the RTE, but the state of the fuzzy matching. The idea here was to differentiate this state from the overall range table state (in general, FuzzyAttrMatchState may be one or the other). But okay. > Within the scanRTEForColumn block, we have a rather large chunk of > code protected by if (rtestate), which contains the only call to > distanceName(). I suggest that we move all of this logic to a > separate, static function, and merge distanceName into it. I also > suggest testing against NULL explicitly instead of implicitly. So > this block of code would end up as something like: > > if (fuzzystate != NULL) > updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate); Okay. > In searchRangeTableForCol, I'm fairly certain that you've changed the > behavior by adding a check for if (rte->rtekind == RTE_JOIN) before > the call to scanRTEForColumn(). Why not instead put this check into > updateFuzzyAttrMatchState? Then you can be sure you're not changing > the behavior in any other case. I thought that I had avoided changing things (beyond what was advertised as changed in relation to this most recent revision) because I also changed things WRT multiple matches per RTE. It's fuzzy. Anyway, yeah, I could do it there instead. > On a similar note, I think the dropped-column test should happen early > as well, probably again in updateFuzzyAttrMatchState(). There's > little point in adding a suggestion only to throw it away again. Agreed. > + /* > + * Charge extra (for inexact matches only) when an alias was > + * specified that differs from what might have been used to > + * correctly qualify this RTE's closest column > + */ > + if (wrongalias) > + rtestate.distance += 3; > > I don't understand what situation this is catering to. Can you > explain? It seems to account for a good deal of complexity. Two cases: 1. Distinguishing between the case where there was an exact match to a column that isn't visible (i.e. the existing reason for errorMissingColumn() to call here), and the case where there is a visible column, but our alias was the wrong one. I guess that could live in errorMissingColumn(), but overall it's more convenient to do it here, so that errorMissingColumn() handles things almost uniformly and doesn't really have to care. 2. For non-exact (fuzzy) matches, it seems more useful to give one match rather than two when the user gave an alias that matches one particular RTE. Consider this: """ postgres=# select ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column "ordersid" does not exist LINE 1: select ordersid from orders o join orderlines ol on o.orderi... ^ HINT: Perhaps you meant to reference the column "o"."orderid" or the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:3166 postgres=# select ol.ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column ol.ordersid does not exist LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... ^ HINT: Perhaps you meant to reference the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:3147 """ One suggestion is better than two if it's evident that that single suggestion is a better fit. And, more broadly, the fact that an alias was given and matches ought to be weighed. > ERROR: column "oid" does not exist > LINE 1: select oid > 0, * from altwithoid; > ^ > +HINT: Perhaps you meant to reference the column "altwithoid"."col". > > That seems like a stretch. I think I suggested before using a > distance threshold of at most 3 or half the word length, whichever is > less. For a three-letter column name that means not suggesting > anything if more than one character is different. What you > implemented here is close to that, yet somehow we've got a suggestion > slipping through that has 2 out of 3 characters different. I'm not > quite sure I see how that's getting through, but I think it shouldn't. It's because I don't apply the test on smaller strings. I felt that it was riskier to apply an absolute quality test when the user-supplied column reference does not exceed 6 characters. Consider my "qty" vs "quantity" example. If I make this change: --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -929,7 +929,7 @@ searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname, * seen when 6 deletions are required against actual attribute name, or 3 * insertions/substitutions. */ - if (state->distance > 6 && state->distance > strlen(colname) / 2) + if (state->distance > strlen(colname) / 2) { state->rsecond = state->rfirst = NULL; state->second = state->first = InvalidAttrNumber; Then a lot of the examples you complain about are fixed. But the "qty" example is broken. Plus, this happens when the regression tests are run: *** /home/pg/postgresql/src/test/regress/expected/alter_table.out 2014-11-17 11:50:16.476426191 -0800 --- /home/pg/postgresql/src/test/regress/results/alter_table.out 2014-11-17 11:57:40.776410110 -0800 *************** *** 1343,1349 **** ERROR: column "f1" does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column "c1"."f2". drop table p1 cascade; NOTICE: drop cascades to table c1 createtable p1 (f1 int, f2 int); And: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-17 11:50:16.480426191 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-17 11:57:08.916411263 -0800 *************** *** 3452,3458 **** ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column "atts"."indexrelid". -- -- Test LATERAL -- (So no hint given in either case) > ERROR: column fullname.text does not exist > LINE 1: select fullname.text from fullname; > ^ > +HINT: Perhaps you meant to reference the column "fullname"."last". > Basically the same problem again. I think the distance threshold in > this case should be half the shorter column name, i.e. 0. Well, there is always going to be the most marginal possible case that still gets to see a suggestion. These are non-organic examples from the regression tests. I'm more worried about having the suggestions work well for organic/representative cases than I am about suppressing non-useful suggestions in non-organic/non-representative cases. As I mentioned, the costing is more or less derived by what I found to work well in what I thought to be representative cases. > Your new test cases include no negative test cases; that is, cases > where the machinery declines to suggest a hint because of, say, 3 > equally good possibilities. They probably should have something like > that. I'll think about that. -- Peter Geoghegan
pgsql-hackers by date: