Thread: Functions as conditions
Hello all,
I'd like some verification on how functions as conditions work and perform on a table, specifically a function that would take a specific column as input, and output a boolean value.psql --version
psql (PostgreSQL) 9.5.0
Suppose I have a table with two columns, id and an md5 with 700000 rows:
Let's create a function that checks if column s is over 500:
CREATE OR REPLACE FUNCTION test_func(integer)
RETURNS boolean
LANGUAGE plpgsql
AS $function$
BEGIN
IF $1 > 500 THEN
return true;
ELSE
return false;
END IF;
END$function$;
RETURNS boolean
LANGUAGE plpgsql
AS $function$
BEGIN
IF $1 > 500 THEN
return true;
ELSE
return false;
END IF;
END$function$;
Running:
EXPLAIN ANALYZE SELECT * FROM t_random WHERE test_func(s)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..187834.00 rows=233333 width=37) (actual time=0.505..572.463 rows=699500 loops=1)
Filter: test_func(s)
Rows Removed by Filter: 500
Planning time: 0.025 ms
Execution time: 600.917 ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..187834.00 rows=233333 width=37) (actual time=0.505..572.463 rows=699500 loops=1)
Filter: test_func(s)
Rows Removed by Filter: 500
Planning time: 0.025 ms
Execution time: 600.917 ms
As opposed to:
EXPLAIN ANALYZE SELECT * FROM t_random WHERE s > 500;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..14584.00 rows=699546 width=37) (actual time=0.080..75.001 rows=699500 loops=1)
Filter: (s > 500)
Rows Removed by Filter: 500
Planning time: 0.029 ms
Execution time: 98.257 ms
The first query plan has a horrendous query time and estimations using test_func(). I'm assuming because the function has 0 statistics on the table itself. It can't base its' input off any index, meaning it has no sense of how big the table is correct? I'm not exactly sure the reason here.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..14584.00 rows=699546 width=37) (actual time=0.080..75.001 rows=699500 loops=1)
Filter: (s > 500)
Rows Removed by Filter: 500
Planning time: 0.029 ms
Execution time: 98.257 ms
The first query plan has a horrendous query time and estimations using test_func(). I'm assuming because the function has 0 statistics on the table itself. It can't base its' input off any index, meaning it has no sense of how big the table is correct? I'm not exactly sure the reason here.
Let's assume that this function is absolutely necessary and that it changes based on some environment variables, while maintaining only a single table. We can not have a static column with fixed boolean values since this column will always be changing depending on the environment. Would a good solution then be to use views and create a column based on the function?
CREATE VIEW t_view AS SELECT *, test_func(s) AS passed FROM t_random;
Running:
EXPLAIN ANALYZE SELECT * FROM t_view WHERE passed = true:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t_view (cost=0.00..194834.00 rows=350000 width=38) (actual time=2.198..661.008 rows=699500 loops=1)
Filter: t_view.test_func
Rows Removed by Filter: 500
-> Seq Scan on t_random (cost=0.00..187834.00 rows=700000 width=37) (actual time=0.026..579.125 rows=700000 loops=1)
Planning time: 0.040 ms
Execution time: 690.057 mss
--------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t_view (cost=0.00..194834.00 rows=350000 width=38) (actual time=2.198..661.008 rows=699500 loops=1)
Filter: t_view.test_func
Rows Removed by Filter: 500
-> Seq Scan on t_random (cost=0.00..187834.00 rows=700000 width=37) (actual time=0.026..579.125 rows=700000 loops=1)
Planning time: 0.040 ms
Execution time: 690.057 mss
The query plan does a bit better but is this the best plan I can hope to achieve with the specific function? Materialized views are not an option since the environment is changing constantly and I would like an always-up-to-date database.
Thank you,
-Bao
Hello all,I'd like some verification on how functions as conditions work and perform on a table, specifically a function that would take a specific column as input, and output a boolean value.
psql --versionpsql (PostgreSQL) 9.5.0CREATE TABLE t_random AS SELECT s, md5(random()::text) FROM generate_Series(1,700000) s;
Suppose I have a table with two columns, id and an md5 with 700000 rows:Let's create a function that checks if column s is over 500:
CREATE OR REPLACE FUNCTION test_func(integer)
RETURNS boolean
LANGUAGE plpgsql
You should always use "sql" if your function can be written to use it acceptably. In this case it can.
[...]
STRICT -- minimal help but its nice to specify your intended behavior when faced with null
IMMUTABLE -- your example has no external dependencies
LANGUAGE sql
AS $function$
SELECT $1 > 500;
$function$;
AS $function$
BEGIN
IF $1 > 500 THEN
return true;
ELSE
return false;
END IF;
END$function$;Running:EXPLAIN ANALYZE SELECT * FROM t_random WHERE test_func(s)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..187834.00 rows=233333 width=37) (actual time=0.505..572.463 rows=699500 loops=1)
Filter: test_func(s)
Rows Removed by Filter: 500
Planning time: 0.025 ms
Execution time: 600.917 msAs opposed to:EXPLAIN ANALYZE SELECT * FROM t_random WHERE s > 500;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..14584.00 rows=699546 width=37) (actual time=0.080..75.001 rows=699500 loops=1)
Filter: (s > 500)
Rows Removed by Filter: 500
Planning time: 0.029 ms
Execution time: 98.257 ms
The first query plan has a horrendous query time and estimations using test_func(). I'm assuming because the function has 0 statistics on the table itself. It can't base its' input off any index, meaning it has no sense of how big the table is correct? I'm not exactly sure the reason here.
Both are sequential scans...the most likely reason is the overhead of invoking a volatile plpgsql function 700,000 times
Let's assume that this function is absolutely necessary and that it changes based on some environment variables, while maintaining only a single table. We can not have a static column with fixed boolean values since this column will always be changing depending on the environment. Would a good solution then be to use views and create a column based on the function?CREATE VIEW t_view AS SELECT *, test_func(s) AS passed FROM t_random;
The addition of the view is immaterial from a performance standpoint - only usability is affected.
Running:EXPLAIN ANALYZE SELECT * FROM t_view WHERE passed = true:QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t_view (cost=0.00..194834.00 rows=350000 width=38) (actual time=2.198..661.008 rows=699500 loops=1)
Filter: t_view.test_func
Rows Removed by Filter: 500
-> Seq Scan on t_random (cost=0.00..187834.00 rows=700000 width=37) (actual time=0.026..579.125 rows=700000 loops=1)
Planning time: 0.040 ms
Execution time: 690.057 mssThe query plan does a bit better but is this the best plan I can hope to achieve with the specific function? Materialized views are not an option since the environment is changing constantly and I would like an always-up-to-date database.
I'm lost but maybe someone else will want to try and figure out you mean. Otherwise consider a concrete example.
If the output of the function is truly non-immutable you are going to have significant difficulty getting any kind of optimization during execution - the best you can hope for is to write the function in question as efficiently as possible and only call it when all other possible conditions already pass. IOW, you will always be faced with a sequential scan when it comes to evaluating the function's result.
With an immutable function you could create a functional index...
David J.
Unfortunately, the function is non-immutable and must stay that way. To make matters worst, it's the condition that must pass before any other condition. This explains the poor performance that I'm getting.
This has cleared much misunderstanding, thank you very much.
-BaoThis has cleared much misunderstanding, thank you very much.
On Thu, Apr 21, 2016 at 1:14 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
Hello all,I'd like some verification on how functions as conditions work and perform on a table, specifically a function that would take a specific column as input, and output a boolean value.
psql --versionpsql (PostgreSQL) 9.5.0CREATE TABLE t_random AS SELECT s, md5(random()::text) FROM generate_Series(1,700000) s;
Suppose I have a table with two columns, id and an md5 with 700000 rows:Let's create a function that checks if column s is over 500:
CREATE OR REPLACE FUNCTION test_func(integer)
RETURNS boolean
LANGUAGE plpgsqlYou should always use "sql" if your function can be written to use it acceptably. In this case it can.[...]STRICT -- minimal help but its nice to specify your intended behavior when faced with nullIMMUTABLE -- your example has no external dependenciesLANGUAGE sqlAS $function$SELECT $1 > 500;$function$;AS $function$
BEGIN
IF $1 > 500 THEN
return true;
ELSE
return false;
END IF;
END$function$;Running:EXPLAIN ANALYZE SELECT * FROM t_random WHERE test_func(s)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..187834.00 rows=233333 width=37) (actual time=0.505..572.463 rows=699500 loops=1)
Filter: test_func(s)
Rows Removed by Filter: 500
Planning time: 0.025 ms
Execution time: 600.917 msAs opposed to:EXPLAIN ANALYZE SELECT * FROM t_random WHERE s > 500;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t_random (cost=0.00..14584.00 rows=699546 width=37) (actual time=0.080..75.001 rows=699500 loops=1)
Filter: (s > 500)
Rows Removed by Filter: 500
Planning time: 0.029 ms
Execution time: 98.257 ms
The first query plan has a horrendous query time and estimations using test_func(). I'm assuming because the function has 0 statistics on the table itself. It can't base its' input off any index, meaning it has no sense of how big the table is correct? I'm not exactly sure the reason here.Both are sequential scans...the most likely reason is the overhead of invoking a volatile plpgsql function 700,000 timesLet's assume that this function is absolutely necessary and that it changes based on some environment variables, while maintaining only a single table. We can not have a static column with fixed boolean values since this column will always be changing depending on the environment. Would a good solution then be to use views and create a column based on the function?CREATE VIEW t_view AS SELECT *, test_func(s) AS passed FROM t_random;The addition of the view is immaterial from a performance standpoint - only usability is affected.Running:EXPLAIN ANALYZE SELECT * FROM t_view WHERE passed = true:QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t_view (cost=0.00..194834.00 rows=350000 width=38) (actual time=2.198..661.008 rows=699500 loops=1)
Filter: t_view.test_func
Rows Removed by Filter: 500
-> Seq Scan on t_random (cost=0.00..187834.00 rows=700000 width=37) (actual time=0.026..579.125 rows=700000 loops=1)
Planning time: 0.040 ms
Execution time: 690.057 mssThe query plan does a bit better but is this the best plan I can hope to achieve with the specific function? Materialized views are not an option since the environment is changing constantly and I would like an always-up-to-date database.I'm lost but maybe someone else will want to try and figure out you mean. Otherwise consider a concrete example.If the output of the function is truly non-immutable you are going to have significant difficulty getting any kind of optimization during execution - the best you can hope for is to write the function in question as efficiently as possible and only call it when all other possible conditions already pass. IOW, you will always be faced with a sequential scan when it comes to evaluating the function's result.With an immutable function you could create a functional index...David J.