Thread: Status report: long-query-string changes
I have finished applying Mike Ansley's changes for long queries, along with a bunch of my own. The current status is: * You can send a query string of indefinite length to the backend. (This is poorly tested for MULTIBYTE, though; would someonewho uses MULTIBYTE more than I do try it out?) * You can get back an EXPLAIN or error message string of indefinite length. * Single lexical tokens within a query are currently limited to 64k because of the lexer's use of YY_REJECT. I have notcommitted any of Leon's proposed lexer changes, since that issue still seems controversial. I would like to see us agreeon a solution. (ecpg's lexer has the same problem, of course.) Although I think the backend is in fairly good shape, there are still a few minor trouble spots. (The rule deparser will blow up at 8K for example --- I have some work to do in there and will fix it when I get a chance.) In the frontend libraries and clients, both libpq and psql are length- limit-free. I have not looked much at any of the other frontend interface libraries. I suspect that at least odbc and the python interface need work, because quick glimpse searches show suspicious- looking constants:MAX_QUERY_SIZEERROR_MSG_LENGTHSQL_PACKET_SIZEMAX_MESSAGE_LENTEXT_FIELD_SIZEMAX_VARCHAR_SIZEDRV_VARCHAR_SIZEDRV_LONGVARCHAR_SIZEMAX_BUFFER_SIZEMAX_FIELDS The real problem in the clients is that pg_dump blithely assumes it will never need to deal with strings over MAX_QUERY_SIZE. This is a bad idea --- it ought to be rewritten to use the expansible-string- buffer facility that now exists in libpq. There may be restrictions in the other programs in bin/ as well, though glimpse didn't turn up any red flags. I would like to encourage the odbc and python folks to get rid of the length limitations in their modules; I don't use either and have no intention of touching either. I'd like to find a volunteer other than myself to fix pg_dump, too. Now, all we need is someone to implement multiple-disk-block tuples ;-) regards, tom lane
Tom Lane wrote: > > * Single lexical tokens within a query are currently limited to 64k > because of the lexer's use of YY_REJECT. I have not committed any > of Leon's proposed lexer changes, since that issue still seems > controversial. I would like to see us agree on a solution. Thomas Lockhart should speak up - he seems the only person who has objections yet. If the proposed thing is to be declined, something has to be applied instead in respect to lexer reject feature and accompanying size limits, as well as grammar inconsistency. Seems there are only awkward solutions as alternatives. As you probably remember, the proposed change only breaks constructs like 1+-2, which anyone in a sane condition should avoid when programming :) There are more size restrictions there. I noticed (by simply eyeing the lexer source, without testing) that in case of flex lexer (FLEX_LEXER being defined in scan.c) lexer can't swallow big queries. You (Tom and Michael) aren't using flex, are you? -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.
Leon <leon@udmnet.ru> writes: > There are more size restrictions there. I noticed (by simply eyeing the > lexer source, without testing) that in case of flex lexer > (FLEX_LEXER being defined in scan.c) lexer can't > swallow big queries. You (Tom and Michael) aren't using flex, > are you? Huh? flex is the only lexer that works with the Postgres .l files, as far as I know. Certainly it's what I'm using. If you're looking at the "literal" buffer, that would need to be made expansible, but there's not much point until flex's internal stuff is fixed. regards, tom lane
Tom Lane wrote: > > If you're looking at the "literal" buffer, that would need to be made > expansible, but there's not much point until flex's internal stuff is > fixed. > > Look at this piece of code. It seems that when myinput() had been called once, for the second time it will return 0 even if string isn't over yet. Parameter 'max' is 8192 bytes on my system. So the query is simply truncated to that size. #ifdef FLEX_SCANNER /* input routine for flex to read input from a string instead of a file */ static int myinput(char* buf, int max) {int len, copylen; if (parseCh == NULL){ len = strlen(parseString); if (len >= max) copylen = max - 1; else copylen =len; if (copylen > 0) memcpy(buf, parseString, copylen); buf[copylen] = '\0'; parseCh = parseString; return copylen;}else return 0; /* end of string */ } #endif /* FLEX_SCANNER */ -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.
Leon <leon@udmnet.ru> writes: > Look at this piece of code. It seems that when myinput() had been called > once, for the second time it will return 0 even if string isn't > over yet. It's always a good idea to pull a fresh copy of the sources before opinionating about what works or doesn't work in someone's just-committed changes ;-) regards, tom lane
> I have finished applying Mike Ansley's changes for long queries, along > with a bunch of my own. The current status is: > > * You can send a query string of indefinite length to the backend. > (This is poorly tested for MULTIBYTE, though; would someone who > uses MULTIBYTE more than I do try it out?) I'll take care of this. --- Tatsuo Ishii
> Thomas Lockhart should speak up - he seems the only person who > has objections yet. If the proposed thing is to be declined, something > has to be applied instead in respect to lexer reject feature and > accompanying size limits, as well as grammar inconsistency. Hmm. I'd suggest that we go with the "greedy lexer" solution, which continues to gobble characters which *could* be an operator until other characters or whitespace are encountered. I don't recall any compelling cases for which this would be an inadequate solution, and we have plenty of time until v6.6 is released to discover problems and work out alternatives. Sorry for slowing things up; but fwiw I *did* think about it some more ;) - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> Thomas Lockhart should speak up... > He knows he'll never have to answer for any of his theories actually > being put to test. If they were, they would be contaminated by reality. You talkin' to me?? ;) So, while you are on the lexer warpath, I'd be really happy if someone would fix the following behavior: (I'm doing this from memory, but afaik it is close to correct) For non-psql applications, such as tcl or ecpg, which do not do any pre-processing on input tokens, a trailing un-terminated string will be lost, and no error will be detected. For example, select * from t1 'abc sent directly to the server will not fail as it should with that garbage at the end. The lexer is in a non-standard mode after all tokens are processed, and the accumulated string "abc" is left in a buffer and not sent to yacc/bison. I think you can see this behavior just by looking at the lexer code. A simple fix would be to check the current size after lexing of that accumulated string buffer, and if it is non-zero then elog(ERROR) a complaint. Perhaps a more general fix would be to ensure that you are never in an exclusive state after all tokens are processed, but I'm not sure how to do that. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart wrote: > > > Thomas Lockhart should speak up - he seems the only person who > > has objections yet. If the proposed thing is to be declined, something > > has to be applied instead in respect to lexer reject feature and > > accompanying size limits, as well as grammar inconsistency. > > Hmm. I'd suggest that we go with the "greedy lexer" solution, which > continues to gobble characters which *could* be an operator until > other characters or whitespace are encountered. 'Xcuse my dumbness ;) , but is it in any way different from what is proposed (by me and some others?) -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.
Thomas Lockhart wrote: > > > Thomas Lockhart should speak up... > > He knows he'll never have to answer for any of his theories actually > > being put to test. If they were, they would be contaminated by reality. > > You talkin' to me?? ;) Nein, nein! Sei still bitte! :) This is my signature which is a week old already :) > A simple fix would be to check the current size after lexing of that > accumulated string buffer, and if it is non-zero then elog(ERROR) a > complaint. Perhaps a more general fix would be to ensure that you are > never in an exclusive state after all tokens are processed, but I'm > not sure how to do that. The solution is obvious - to eliminate exclusive states entirely! Banzai!!! -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.
Thomas Lockhart wrote: > > > The solution is obvious - to eliminate exclusive states entirely! > > Banzai!!! > > That will complicate the lexer, and make it more brittle and difficult > to read, since you will have to, essentially, implement the exclusive > states using flags within each element. > > If you want to try it as an exercise, we *might* find it isn't as ugly > as I am afraid it will be, but... > Gimme the latest lexer source. (I pay for my Internet on a per minute basis, so I can't connect to CVS) You will see what I mean. -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.
Leon <leon@udmnet.ru> writes: >> A simple fix would be to check the current size after lexing of that >> accumulated string buffer, and if it is non-zero then elog(ERROR) a >> complaint. Perhaps a more general fix would be to ensure that you are >> never in an exclusive state after all tokens are processed, but I'm >> not sure how to do that. > The solution is obvious - to eliminate exclusive states entirely! > Banzai!!! Can we do that? Seems like a more likely approach is to ensure that all of the lexer states have rules that ensure they terminate (or raise an error, as for unterminated quoted string) at end of input. I do think checking the token buffer is a hack, and changing the rules a cleaner solution... regards, tom lane
Tom Lane wrote: > > Leon <leon@udmnet.ru> writes: > >> A simple fix would be to check the current size after lexing of that > >> accumulated string buffer, and if it is non-zero then elog(ERROR) a > >> complaint. Perhaps a more general fix would be to ensure that you are > >> never in an exclusive state after all tokens are processed, but I'm > >> not sure how to do that. > > > The solution is obvious - to eliminate exclusive states entirely! > > Banzai!!! > > Can we do that? Seems like a more likely approach is to ensure that > all of the lexer states have rules that ensure they terminate (or > raise an error, as for unterminated quoted string) at end of input. > I do think checking the token buffer is a hack, and changing the rules > a cleaner solution... Hmm, yea, you are right. It is much simpler solution. We can check condition in myinput() and input() when we are going to return end-of-input (YYSTATE == INITIAL), and raise an error if that's not so. Well, I give up my idea of total extermination of start conditions :) BTW, while eyeing the scan.l again, I noticed that C - style comments can also contain bugs, but I am not completely sure. -- Leon. ------- He knows he'll never have to answer for any of his theories actually being put to test. If they were, they would be contaminated by reality.