BUG #15307: Low numerical precision of (Co-) Variance - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #15307: Low numerical precision of (Co-) Variance |
Date | |
Msg-id | 153313051300.1397.9594490737341194671@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #15307: Low numerical precision of (Co-) Variance
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 15307 Logged by: Erich Schubert Email address: erich@debian.org PostgreSQL version: 9.6.6 Operating system: Linux Description: Numerical precision of variance computations in PostgreSQL is too low. SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x The first two give the low-precision answer 0.000244140625 instead of 0.0002. Interestingly enough, VAR_SAMP(x) is okay - I guess that postgres may autodetect a fixed-precision decimal here, or use some high precision code. If you add just another digit (10 million +- 1 cent), the bad functions even return a variance of 0: SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) FROM (SELECT 10000000.01 as x UNION SELECT 9999999.99 as x) AS x The reason is catastrophic cancellation. Apparently, the covariance is computed using the E[XY]-E[X]E[Y] approach, which suffers from low numerical precision. While I report this against version 9.6.6 (since I used sqlfiddle), this clearly is present still in Git: https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2606 https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2969 it should also affect the general "numeric" version (i.e. all versions of variance/covariance/standard deviation use the known-bad equation), but for integers it usually will be fine as long as the sum-of-squares can be stored: https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/numeric.c#L4883 There are a number of methods to get a reasonable precision. The simplest to implement is a two pass approach: first compute the mean of each variable, then compute E[(X-meanX)*(Y-meanY)] in a second pass. This will usually give the best precision. Faster (single-pass) methods can be found in literature from the 1970s, and also Donald Knuth's "The Art of Computer Programming". In particular Young and Cramer's version performed quite well (and surprisingly much better than Welford's; supposedly due to CPU pipelining). Single-pass and parallelization-friendly approaches (interesting to use in particular in distributed databases, but also to avoid IO cost) with good accuracy are studied in: > Erich Schubert, and Michael Gertz. Numerically Stable Parallel Computation of (Co-)Variance. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM), Bolzano-Bozen, Italy. 2018, 10:1–10:12. DOI: 10.1145/3221269.3223036. https://dl.acm.org/citation.cfm?doid=3221269.3223036 The problem has also been observed in other SQL databases (MS - others like MySQL have implemented a numeric stable single-pass approach), see e.g.: > Kamat, Niranjan, and Arnab Nandi. "A Closer Look at Variance Implementations In Modern Database Systems." ACM SIGMOD Record 45.4 (2017): 28-33. DOI: 10.1145/3092931.3092936. https://dl.acm.org/citation.cfm?id=3092936 Sorry, I do not know the PostgreSQL internals (aggregation...) well enough to provide a patch.
pgsql-bugs by date: