Thread: Avoiding double-counting in aggregates with more than one join?
Hi All, I've noticed in the past that doing aggregates while joining to more than one table can sometimes give you unintended results. For example, suppose I have three tables: products, sales, and resupplies. In sales I track what I sell, and in resupplies I track my own purchases to increase inventory. Both have a foreign key to products. Now I want to run a report showing the total dollars sold for each product versus the total dollars spent for each product. I could try this: SELECT p.id, SUM(s.price * s.qty) AS total_sold, SUM(r.price * r.qty) AS total_spent FROM products p LEFT OUTER JOIN sales s ON s.product_id = p.id LEFT OUTER JOIN resupplies r ON r.product_id = p.id GROUP BY p.id ; That seems pretty safe, but actually I get bad answers, for example if some product has this data: sales ----- sold 1 @ $2/ea resupplies ---------- bought 1 @ $1/eq bought 2 @ $1/ea Then pre-grouping I have this: p.id | s.qty | s.price | r.qty | r.price -----+-------+---------+-------+-------- 1 | 1 | $2 | 1 | $1 1 | 1 | $2 | 2 | $1 You can see the problem is that I'm going to double-count my sales. What I really want is this: p.id | s.qty | s.price | r.qty | r.price -----+-------+---------+-------+-------- 1 | 1 | $2 | 1 | $1 1 | | | 2 | $1 In the past I've always handled these situations by aggregating each table separately and only then joining things together: WITH s AS ( SELECT product_id, SUM(price * qty) AS total_sold FROM sales GROUP BY product_id) s ), r AS ( SELECT product_id, SUM(price * qty) AS total_spent FROM resupplies GROUP BY product_id) r ) SELECT p.id, COALESCE(s.total_sold, 0), COALESCE(r.total_spent, 0) FROM products p LEFT OUTER JOIN s ON s.product_id = p.id LEFT OUTER JOIN r ON r.product_id = p.id ; Since I've guaranteed that each CTE includes at most one row per product, this is safe from double-counting errors. But is there a better way? My approach feels verbose and harder to read. Also you have to type COALESCE a lot. :-) Is there some different way of doing things I haven't thought of yet? Also I wonder about the performance merging all these subqueries together. Would the final merging be any faster if I had an ORDER BY in each CTE? It seems like this pattern comes up a lot; what have others done about it? Thanks, Paul
But is there a better way?
Nothing that would be more readable nor likely more performant.
When performing aggregation it is necessary to limit the scope of the query to only whatever it is you are calculating. Since you wish to compute two things you need two separate parts plus a third to combine them.
If performance is a concern you should move the aggregation queries directly to the main query instead of using the optimization fencing CTE.
SELECT
FROM products
LEFT JOIN (
SELECT sum()
) s USING (product_id)
LEFT JOIN (
SELECT sum()
) r USING (product_id)
If the second "scope" doesn't need to be calculated but simply informs the one-and-only aggregate you should use SEMI JOIN (EXISTS) instead of a INNER/LEFT JOIN. But that isn't what you have here.
David J.