Re: BUG #19340: Wrong result from CORR() function - Mailing list pgsql-bugs
| From | Tom Lane |
|---|---|
| Subject | Re: BUG #19340: Wrong result from CORR() function |
| Date | |
| Msg-id | 648885.1764779616@sss.pgh.pa.us Whole thread Raw |
| In response to | Re: BUG #19340: Wrong result from CORR() function (Dean Rasheed <dean.a.rasheed@gmail.com>) |
| List | pgsql-bugs |
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Poking further at this, I found that my v2 patch fails that principle
>> in one case:
>>
>> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>> corr
>> ------
>>
>> (1 row)
>>
>> We see that Y is constant and therefore return NULL, despite the
>> other NaN input.
> I think that would be more readable as 2 separate "if" statements,
Actually, in the light of morning I think this behavior is probably
fine. This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win. A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN. Should we change that? I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.
So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable. However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind. What do you think?
> Another case to consider is this:
> SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
> Here Sxx and Syy end up being zero, so the current code returns NULL,
> but with the patch it returns NaN (because Sxy is also zero). It's not
> quite obvious what to do in this case (the correct answer is 1, but
> there' no way of reliably computing that with double precision
> arithmetic). My first thought is that we should probably try to limit
> NaN results to NaN inputs, and so it would be better to continue to
> return NULL for cases like this where either Sxx or Syy are zero, even
> though the inputs weren't quite constant.
Good example. I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).
> Then there's this:
> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
> Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
> product Sxx * Syy underflows to zero. So in both HEAD and with the
> patch, this returns Infinity. That's not good, given that the
> correlation coefficient is supposed to lie in the range [-1,1].
> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].
I like the idea of taking the square roots separately. I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.
regards, tom lane
pgsql-bugs by date: