Re: Removing redundant itemsets - Mailing list pgsql-sql
From | Craig Ringer |
---|---|
Subject | Re: Removing redundant itemsets |
Date | |
Msg-id | 47F0CB12.5010506@postnewspapers.com.au Whole thread Raw |
In response to | Re: Removing redundant itemsets (Craig Ringer <craig@postnewspapers.com.au>) |
Responses |
Re: Removing redundant itemsets
|
List | pgsql-sql |
Craig Ringer wrote: > Allan Kamau wrote: >> Hi all, >> I have a list of purchases (market basket) and I would like to select >> non redundant longest possible patterns by eliminating >> (creating/populating other table to contain only non redandant itemsets) >> purchases having item lists which are fully included in at least one >> other purchase. > > Here's a possibly slow and surely ugly solution (I think it's right, > though I haven't done more than passing testing): > > > > CREATE VIEW togo_as_arr AS > SELECT a.tid, > ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item) > AS items > FROM togo a GROUP BY tid; > > SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by > FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b > WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items; Alternately: -- Helps with the massively repeated subquery below CREATE INDEX togo_by_tid_and_item ON togo(tid,item); -- Find any `a' for which `item_from_a_is_in_b' is -- true for all items in `a' SELECT a_tid AS is_redundant, b_tid AS contained_by FROM ( -- For every item in every pair of purchases, -- determine whether the item in purchase `a' -- was also in purchase`b'. SELECT a.tid AS a_tid, b.tid AS b_tid, a.item AS item, EXISTS( -- Was this item from `a' also inthe `b' purchase? SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item ) AS item_from_a_is_in_b FROM togoa INNER JOIN togo b ON (a.tid <> b.tid) GROUP BY a.tid, b.tid, a.item) AS item_containment GROUP BY a_tid, b_tid HAVING every(item_from_a_is_in_b); ... which avoids the array building, but is actually considerably slower on the trivial test data. That's not too surprising given that this approach requires a subquery: SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item for EVERY item to be tested. Twice, actually, as each record appears in both `a' and `b' positions. I'd be very interested to see what happened on real world test data, especially compared to doing the array accumulation based query off a temporary table instead of a view. I suspect it'll depend on the average number of items per purchase - lots of items per purchase and the array building cost will dominate. That's really just a guess, though. I'm sure there's a properly smart way to do this that I just can't figure out, but this is the best I've come up with so far. -- Craig Ringer