Re: sequence query performance issues - Mailing list pgsql-performance
From | Richard Huxton |
---|---|
Subject | Re: sequence query performance issues |
Date | |
Msg-id | 46FCB5F3.4080600@archonet.com Whole thread Raw |
In response to | sequence query performance issues ("Peter Koczan" <pjkoczan@gmail.com>) |
Responses |
Re: sequence query performance issues
|
List | pgsql-performance |
Peter Koczan wrote: > Hello, > > I have a weird performance issue with a query I'm testing. Basically, > I'm trying to port a function that generates user uids, and since > postgres offers a sequence generator function, I figure I'd take > advantage of that. Basically, I generate our uid range, filter out > those which are in use, and randomly pick however many I need. > However, when I run it it takes forever (>10 minutes and I get nothing > so I cancelled the query) and cpu usage on the server is maxed out. I'd suspect either an unconstrained join or looping through seq-scans. > Here's my query (I'll post the explain output later so as not to > obscure my question): > => select a.uid from generate_series(1000, 32767) as a(uid) where > a.uid not in (select uid from people) order by random() limit 1; I let this run to it's conclusion and it's the materialize. If you see, it's materializing the result-set once for every value it tests against (loops=31768) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=62722.66..62722.67 rows=1 width=4) (actual time=189963.485..189963.485 rows=0 loops=1) -> Sort (cost=62722.66..62723.91 rows=500 width=4) (actual time=189961.063..189961.063 rows=0 loops=1) Sort Key: random() -> Function Scan on generate_series a (cost=184.00..62700.25 rows=500 width=4) (actual time=189960.797..189960.797 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=184.00..284.00 rows=10000 width=2) (actual time=0.000..2.406 rows=9372 loops=31768) -> Seq Scan on people (cost=0.00..174.00 rows=10000 width=2) (actual time=0.055..7.181 rows=10000 loops=1) Total runtime: 189967.150 ms Hmm - why is it doing that? It's clearly confused about something. I suspect the root of the problem is that it doesn't know what generate_series() will return. To the planner it's just another set-returning function. This means it's getting (i) the # of rows wrong (rows=500) and also doesn't know (ii) there will be no nulls or (iii) what the range of values returned will be. Easy enough to test: CREATE TEMP TABLE all_uids (uid int2); INSERT INTO all_uids SELECT generate_series(1000,32767); ANALYSE all_uids; EXPLAIN ANALYSE SELECT a.uid FROM all_uids a WHERE a.uid NOT IN (SELECT uid FROM people) ORDER BY random() LIMIT 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Limit (cost=1884.14..1884.14 rows=1 width=2) (actual time=39.019..39.019 rows=0 loops=1) -> Sort (cost=1884.14..1923.85 rows=15884 width=2) (actual time=39.014..39.014 rows=0 loops=1) Sort Key: random() -> Seq Scan on all_uids a (cost=199.00..775.81 rows=15884 width=2) (actual time=38.959..38.959 rows=0 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..174.00 rows=10000 width=2) (actual time=0.046..7.282 rows=10000 loops=1) Total runtime: 39.284 ms That's more sensible. I'd actually use a table to track unused_uids and have triggers that kept everything in step. However, if you didn't want to do that, I'd try a left-join. EXPLAIN ANALYSE SELECT a.uid FROM generate_series(1000, 32767) as a(uid) LEFT JOIN people p ON a.uid=p.uid WHERE p.uid IS NULL ORDER BY random() LIMIT 1; Not ideal, but like I say I'd use an unused_uids table. If nothing else, I'd be wary about immediately re-using a uid - your db+application might cope fine, but these values have a tendency to be referred to elsewhere. HTH -- Richard Huxton Archonet Ltd
pgsql-performance by date: