Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Hash id in pg_stat_statements |
Date | |
Msg-id | CAEYLb_UYZtizMCD0g+2txhC7vdeGA9-9XjCs-f=gPvgx0_+5aA@mail.gmail.com Whole thread Raw |
In response to | Re: Hash id in pg_stat_statements (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Hash id in pg_stat_statements
Re: Hash id in pg_stat_statements |
List | pgsql-hackers |
On 1 October 2012 18:05, Stephen Frost <sfrost@snowman.net> wrote: > * Peter Geoghegan (peter@2ndquadrant.com) wrote: >> That won't really help matters. There'd still be duplicate entries, >> from before and after the change, even if we make it immediately >> obvious which is which. The only reasonable solution in that scenario >> is to bump PGSS_FILE_HEADER, which will cause all existing entries to >> be invalidated. > > You're going to have to help me here, 'cause I don't see how there can > be duplicates if we include the PGSS_FILE_HEADER as part of the hash, > unless we're planning to keep PGSS_FILE_HEADER constant while we change > what the hash value is for a given query, yet that goes against the > assumptions that were laid out, aiui. Well, they wouldn't be duplicates if you think that the fact that one query was executed before some point release and another after ought to differentiate queries. I do not. > If there's a change that results in a given query X no longer hashing to > a value A, we need to change PGSS_FILE_HEADER to invalidate statistics > which were collected for value A (or else we risk an independent query Y > hashing to value A and ending up with completely invalid stats..). > Provided we apply that consistently and don't reuse PGSS_FILE_HEADER > values along the way, a combination of PGSS_FILE_HEADER and the hash > value for a given query should be unique over time. > > We do need to document that the hash value for a given query may > change.. By invalidate, I mean that when we go to open the saved file, if the header doesn't match, the file is considered corrupt, and we simply log that the file could not be read, before unlinking it. This would be necessary in the unlikely event of there being some substantive change in the representation of query trees in a point release. I am not aware of any precedent for this, though Tom said that there was one. Tom: Could you please describe the precedent you had in mind? I would like to better understand this risk. I don't want to get too hung up on what we'd do if this problem actually occurred, because that isn't what this thread is about. Exposing the hash is a completely unrelated question, except that to do so might make pg_stat_statements (including its limitations) better understood. In my opinion, the various log parsing tools have taught people to think about this in the wrong way - the query string is just a representation of a query (and an imperfect one at that). There are other, similar tools that exist in proprietary databases. They expose a hash value, which is subject to exactly the same caveats as our own. They explicitly encourage the type of aggregation by third-party tools that I anticipate will happen with pg_stat_statements. I want to facilitate this, but I'm confident that the use of (dbid, userid, querytext) as a "candidate key" by these tools is going to cause confusion, in subtle ways. By using the hash value, those tools are exactly as well off as pg_stat_statements is, which is to say, as well off as possible. I simply do not understand objections to the proposal. Have I missed something? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: