Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Hash id in pg_stat_statements |
Date | |
Msg-id | CAEYLb_WkgKhbVjvJDWcOtPYaBRF22sc_dBSfE2k+nA75RQ=azA@mail.gmail.com Whole thread Raw |
In response to | Re: Hash id in pg_stat_statements (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Hash id in pg_stat_statements
|
List | pgsql-hackers |
On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote > 1. Why isn't something like md5() on the reported query text an equally > good solution for users who want a query hash? Because that does not uniquely identify the entry. The very first thing that the docs say on search_path is "Qualified names are tedious to write, and it's often best not to wire a particular schema name into applications anyway". Presumably, the reason it's best not to wire schema names into apps is because it might be useful to modify search_path in a way that dynamically made the same queries in some application reference what are technically distinct relations. If anyone does this, and it seems likely that many do for various reasons, they will be out of luck when using some kind of pg_stat_statements aggregation. This was the behaviour that I intended for pg_stat_statements all along, and I think it's better than a solution that matches query strings. > 2. If people are going to accumulate stats on queries over a long period > of time, is a 32-bit hash really good enough for the purpose? If I'm > doing the math right, the chance of collision is already greater than 1% > at 10000 queries, and rises to about 70% for 100000 queries; see > http://en.wikipedia.org/wiki/Birthday_paradox > We discussed this issue and decided it was okay for pg_stat_statements's > internal hash table, but it's not at all clear to me that it's sensible > to use 32-bit hashes for external accumulation of query stats. Well, forgive me for pointing this out, but I did propose that the hash be a 64-bit value (which would have necessitated adopting hash_any() to produce 64-bit values), but you rejected the proposal. I arrived at the same probability for a collision as you did and posted in to the list, in discussion shortly after the normalisation stuff was committed. A more sensible way of assessing the risk of a collision would be to try and come up with the probability of a collision that someone actually ends up caring about, which is considerably less than the 1% for 10,000 entries. I'm not being glib - people are very used to the idea that aggregating information on query costs is a lossy process. Prior to 9.2, the only way execution costs could reasonably be measured on at the query granularity on a busy system was to set log_min_duration_statement to something like 1 second. I am also unconvinced by the idea that aggregating historical data (with the same hash value) in a separate application is likely to make the collision situation appreciably worse. People are going to be using something like an RRD circular buffer to aggregate the information, and I can't see anyone caring about detailed information that is more than a couple of weeks in the past. The point of aggregation isn't to store more queries, it's to construct time-series data from snapshots. Besides, do most applications really even have more than 10,000 distinct queries? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: