Re: short-cutting if sum()>constant - Mailing list pgsql-sql
From | Ivan Sergio Borgonovo |
---|---|
Subject | Re: short-cutting if sum()>constant |
Date | |
Msg-id | 20091223115628.1709caf5@dawn.webthatworks.it Whole thread Raw |
In response to | Re: short-cutting if sum()>constant (Pavel Stehule <pavel.stehule@gmail.com>) |
Responses |
Re: short-cutting if sum()>constant
|
List | pgsql-sql |
On Wed, 23 Dec 2009 08:38:52 +0100 Pavel Stehule <pavel.stehule@gmail.com> wrote: > > As even more unexpected... when all row are >0 and most of them > > are equal to 1 the generate_series performs appreciably better > > (roughly 15% faster). > > And I think your version can be further optimised: > > select count(*) from (select (generate_series(1,a))::int from > > data limit 90000000) s; > > This perform 30% faster. > > > > So what's so slow in the plpgsql version? > > don't forget - plpgsql is interpret - it is best as glue for SQL Yeah but how many times is it interpreted in a 1M cycle expecially when you ran it more than one time and the interpreter should be "hot"? I don't know how much "interpreter" work can be saved from a previous run of the function... but still there should be something left [1]. plpgsql isn't duck typed and that function doesn't leave too much space for "interpretation". Unless I'm missing something, in a typed language like plpgsql that function could be easily turned into its C equivalent by the interpreter. I really thought the cost of running plpgsql was much lower in such kind of situation. Consider that the cost doesn't come from the initial cost of interpreting the function but it is proportional to the numbers of cycles. Even hard coding the LIMIT in the plpgsql version doesn't make it faster enough. So it is not the planner. In a 10M rows dataset where 8332885 are =1 and the rest are =2: If I'm testing for >900000 I get 940.580 ms vs 1302.162 ms If I'm testing for >9000000 I get 8816.263 ms vs 12306.483 ms BTW select sum(a) from data; takes 1999.492 ms. select count(*) from data; takes 1612.039 ms While smart your way is definitively more convoluted and should require much more cpu cycles and memory. The plperl version: create or replace function totano(int) returns int as $$ my $b = 0; my $row; my $sth = spi_query("select a from data"); while(defined($row=spi_fetchrow($sth))){ $b += $row->{a}; if($b>=$_[0]) { return $b; } } return $b; $$ LANGUAGE plperl; Is 10 times slower than the plpgsql version. > statement. I don't thing so plpgsql is slow - speed is similar to > using buildin functionality. But I am sure, rewritening your > function to C could help. If you need maximal speed. In my case it is not worth. I was just looking into something that didn't have to scan all the rows. With the real distribution of data and no real workload on the box your solution is faster but the difference while unexpectedly appreciable is nearly negligible. Anyway I'm not able to justify the difference in speed between plpgsql and your solution in such case unless plpgsql is switching back and forward between binary data and their text representation. It would be nice if someone that know plpgsql internals explain where the cost comes from. > I thing, so there are other trick, I am not sure if it is faster. > You can create own aggregate. In state function you can calculate > and check state value. If it is over your limit, then you can raise > exception. So if your query will be finished with custom exception, > then sum(c) > n is true. I may test it later. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it