finding medians - Mailing list pgsql-hackers
From | Josh Burdick |
---|---|
Subject | finding medians |
Date | |
Msg-id | 3CF688AB.4010005@gradient.cis.upenn.edu Whole thread Raw |
Responses |
Re: finding medians
Re: finding medians Re: finding medians |
List | pgsql-hackers |
At the end of this message is some code I used to find medians. It's kind of a hack, but approximately works, and isintended as a somewhat awkward stopgap for people who need to use medians. It illustrates the limitations of the current aggregate function setup, which works so nicely for avg() and stddev(). I don't have any good solutions. I tried using a float4[] to store each element as it's added, but I couldn't get array updates working in PL/PgSQL, so that didn't help. Perhaps aggregate functions could be passed an array? Or a cursor, pointing at the first line? I'm not sure. Anyways, perhaps it'll be helpful. Josh -- Josh Burdick jburdick@gradient.cis.upenn.edu http://www.cis.upenn.edu/~jburdick /* Implementing median-finding in "pure Postgres." Does this by copying data to a temporary table. A weakness of this code is that it uses sorting, instead of Hoare's linear-time median algorithm. Presumably sorting is implemented so efficiently that it'll be faster than anything written in PL/PgSQL. (Although Hoare's algorithm implemented in C would be faster than either.) BUG: this isn't properly set up to deal with multiple users. For example, if A computes a median, then B could read the data from the median_tmp table. Possibly you could fiddle with transaction isolation levels, or add a user field to median_tmp, or something else complicated, to prevent this, but for now I'm not worrying about this. Written by Josh Burdick (jburdick@gradient.cis.upenn.edu). Anyone can use this under the same license as Postgres. 20020524, jtb: started. */ drop aggregate median(float4); drop table median_tmp; drop sequence median_id; drop index median_tmp_median_id; drop function median_sfunc_float4(bigint, float4); drop function median_finalfunc_float4(bigint); create sequence median_id; create table median_tmp ( median_id int, x float4 ); create index median_tmp_median_id on median_tmp(median_id); create function median_sfunc_float4 (bigint, float4) returns bigint as ' insert into median_tmp values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2); select currval(''median_id''); ' language 'SQL'; create function median_finalfunc_float4 (bigint) returns float4 as ' declare i bigint; n bigint; c refcursor; m float4; m1 float4; begin n := (select count(*) from median_tmp where median_id = $1); open c for select x from median_tmp where median_id = $1 order by x; for i in 1..((n+1)/2) loop fetch c into m; end loop; /* if n is even, fetch the next value, and average the two */ if (n % int8(2) = int8(0)) then fetch c into m1; m := (m + m1) / 2; end if; delete from median_tmp where median_id = $1; return m; end ' language 'plpgsql'; create aggregate median ( basetype = float4, stype = bigint, initcond = 0, sfunc = median_sfunc_float4, finalfunc = median_finalfunc_float4 );
pgsql-hackers by date: