Home > mailing lists

Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers

From	David Fetter
Subject	Re: [HACKERS] multivariate statistics (v19)
Date	February 8, 2017 19:09:08
Msg-id	20170208160908.GC8118@fetter.org Whole thread Raw
In response to	Re: [HACKERS] multivariate statistics (v19) (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses	Re: [HACKERS] multivariate statistics (v19)
List	pgsql-hackers

Tree view

On Wed, Feb 08, 2017 at 03:23:25PM +0000, Dean Rasheed wrote:
> On 6 February 2017 at 21:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Tomas Vondra wrote:
> >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> >
> >> > Nearby, some auxiliary functions such as n_choose_k and
> >> > num_combinations are not documented. What it is that they do? I'd
> >> > move these at the end of the file, keeping the important entry points
> >> > at the top of the file.
> >>
> >> I'd say n-choose-k is pretty widely known term from combinatorics. The
> >> comment would essentially say just 'this is n-choose-k' which seems rather
> >> pointless. So as much as I dislike the self-documenting code, this actually
> >> seems like a good case of that.
> >
> > Actually, we do have such comments all over the place.  I knew this as
> > "n sobre k", so the english name doesn't immediately ring a bell with me
> > until I look it up; I think the function comment could just say
> > "n_choose_k -- this function returns the binomial coefficient".
> 
> One of the things you have to watch out for when writing code to
> compute binomial coefficients is integer overflow, since the numerator
> and denominator get large very quickly. For example, the current code
> will overflow for n=13, k=12, which really isn't that large.
> 
> This can be avoided by computing the product in reverse and using a
> larger datatype like a 64-bit integer to store a single intermediate
> result. The point about multiplying the terms in reverse is that it
> guarantees that each intermediate result is an exact integer (a
> smaller binomial coefficient), so there is no need to track separate
> numerators and denominators, and you avoid huge intermediate
> factorials. Here's what that looks like in psuedo-code:
> 
> binomial(int n, int k):
>     # Save computational effort by using the symmetry of the binomial
>     # coefficients
>     k = min(k, n-k);
> 
>     # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
>     # starting from binomial(n-k, 0) = 1, and computing the sequence
>     # binomial(n-k+1, 1), binomial(n-k+2, 2), ...
>     #
>     # Note that each intermediate result is an exact integer.
>     int64 result = 1;
>     for (int i = 1; i <= k; i++)
>     {
>         result = (result * (n-k+i)) / i;
>         if (result > INT_MAX) Raise overflow error
>     }
>     return (int) result;
> 
> 
> Note also that I think num_combinations(n) is just an expensive way of
> calculating 2^n - n - 1.

Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {   if (k > n) {       return 0;   }   unsigned long long r = 1;
for(unsigned long long d = 1; d <= k; ++d) {       r *= n--;       r /= d;   }   return r;
 
}

which greatly reduces the chance of overflow.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

pgsql-hackers by date:

From: Andrew Dunstan
Date: 08 February 2017, 19:07:51
Subject: Re: [HACKERS] drop support for Python 2.3

From: Daniel Gustafsson
Date: 08 February 2017, 19:15:35
Subject: Re: [HACKERS] [PATCH] configure-time knob to set default ssl ciphers

Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers

Previous

Next