Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers
From | David Fetter |
---|---|
Subject | Re: [HACKERS] multivariate statistics (v19) |
Date | |
Msg-id | 20170208160908.GC8118@fetter.org Whole thread Raw |
In response to | Re: [HACKERS] multivariate statistics (v19) (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] multivariate statistics (v19)
|
List | pgsql-hackers |
On Wed, Feb 08, 2017 at 03:23:25PM +0000, Dean Rasheed wrote: > On 6 February 2017 at 21:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Tomas Vondra wrote: > >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote: > > > >> > Nearby, some auxiliary functions such as n_choose_k and > >> > num_combinations are not documented. What it is that they do? I'd > >> > move these at the end of the file, keeping the important entry points > >> > at the top of the file. > >> > >> I'd say n-choose-k is pretty widely known term from combinatorics. The > >> comment would essentially say just 'this is n-choose-k' which seems rather > >> pointless. So as much as I dislike the self-documenting code, this actually > >> seems like a good case of that. > > > > Actually, we do have such comments all over the place. I knew this as > > "n sobre k", so the english name doesn't immediately ring a bell with me > > until I look it up; I think the function comment could just say > > "n_choose_k -- this function returns the binomial coefficient". > > One of the things you have to watch out for when writing code to > compute binomial coefficients is integer overflow, since the numerator > and denominator get large very quickly. For example, the current code > will overflow for n=13, k=12, which really isn't that large. > > This can be avoided by computing the product in reverse and using a > larger datatype like a 64-bit integer to store a single intermediate > result. The point about multiplying the terms in reverse is that it > guarantees that each intermediate result is an exact integer (a > smaller binomial coefficient), so there is no need to track separate > numerators and denominators, and you avoid huge intermediate > factorials. Here's what that looks like in psuedo-code: > > binomial(int n, int k): > # Save computational effort by using the symmetry of the binomial > # coefficients > k = min(k, n-k); > > # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k, > # starting from binomial(n-k, 0) = 1, and computing the sequence > # binomial(n-k+1, 1), binomial(n-k+2, 2), ... > # > # Note that each intermediate result is an exact integer. > int64 result = 1; > for (int i = 1; i <= k; i++) > { > result = (result * (n-k+i)) / i; > if (result > INT_MAX) Raise overflow error > } > return (int) result; > > > Note also that I think num_combinations(n) is just an expensive way of > calculating 2^n - n - 1. Combinations are n!/(k! * (n-k)!), so computing those is more along the lines of: unsigned long long choose(unsigned long long n, unsigned long long k) { if (k > n) { return 0; } unsigned long long r = 1; for(unsigned long long d = 1; d <= k; ++d) { r *= n--; r /= d; } return r; } which greatly reduces the chance of overflow. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
pgsql-hackers by date: