Comparing user attributes with bitwise operators - Mailing list pgsql-performance
From | Patrick Clery |
---|---|
Subject | Comparing user attributes with bitwise operators |
Date | |
Msg-id | 200409160141.37604.patrick@phpforhire.com Whole thread Raw |
Responses |
Re: Comparing user attributes with bitwise operators
Re: Comparing user attributes with bitwise operators Re: Comparing user attributes with bitwise operators |
List | pgsql-performance |
I'm working on a dating/personals/match-making site, that has used many different methods of "match-making", that all seem to be very slow. One I am attempting now that seems to be an efficient method of storage, but not the best for indexing, is using bitwise operators to compare one person's profile to another's. This method seems to be very fast on a small scale, but I am dealing with a large user-base here, in excess of 200,000 users that will be executing this search function every time they login (the search results of their profile will appear on the main page after they have logged in). I've opted to use "label tables" for each possible set of answers. (i.e: Marital Status) For this table, the set of bits -- bit(5) -- are represented as such: +-----+------------+ | Bit | Meaning | +-----+------------+ | 1 | single | | 2 | separated | | 3 | married | | 4 | divorced | | 5 | widowed | +-----+------------+ Here's the structure of the marital status table: # \d marital_matrix Table "public.marital_matrix" Column | Type | Modifiers -----------+----------------+----------------------------------------------------------------------- member_id | integer | not null default nextval('public.marital_matrix_member_id_seq'::text) status | bit varying(5) | not null default (0)::bit(5) p_status | bit varying(5) | not null default (0)::bit(5) Indexes: "marital_matrix_pkey" PRIMARY KEY, btree (member_id) "idx_marital_matrix" btree ((status::"bit" & p_status::"bit")) "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit")) "idx_marital_p_status" btree (p_status) "idx_marital_status" btree (status) Foreign-key constraints: "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a smaller sample than what I will actually be using): datingsite=> select count(*),status,p_status from marital_matrix group by status,p_status; count | status | p_status -------+--------+---------- 89 | 00001 | 00000 1319 | 00010 | 00000 2465 | 00100 | 00000 1 | 00100 | 11111 46117 | 10000 | 00000 here is the user I'll be comparing against, which has selected that he be matched with any but married people: datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21; member_id | status | p_status -----------+--------+---------- 21 | 10000 | 11011 (1 row) Here are a few possible comparison methods I can think of (NOTE: tests were run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10: METHOD 1: Any bit that is set in p_status (prefered marital status) of the searching user should be set in the potential match's marital status. This is the method I'd like to improve, if possible. Running the query twice didn't produce a different runtime. EXPLAIN ANALYZE SELECT m2.member_id FROM marital_matrix m1, marital_matrix m2 WHERE m1.member_id = 21 AND m2.status & m1.p_status != B'00000'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual time=18.062..708.469 rows=47525 loops=1) Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> B'00000'::"bit") -> Index Scan using marital_matrix_pkey on marital_matrix m1 (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1) Index Cond: (member_id = 21) -> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13) (actual time=17.966..255.529 rows=49991 loops=1) Total runtime: 905.694 ms (6 rows) METHOD 2: Specifying the value (I don't think this would make a difference, but I'll post anyways): EXPLAIN ANALYZE SELECT member_id FROM marital_matrix WHERE status & B'11011' != B'00000'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual time=18.113..281.101 rows=47525 loops=1) Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit") Total runtime: 480.836 ms (3 rows) METHOD 3: Checking for one bit only. This is definitely not a "real world" example and unacceptable since the p_status column can and will have multiple bits. For categories other than "Marital Status", such as "Prefered Hair Color", the users are likely to select multiple bits (they choose all that apply). This query does use the index, but is still not very fast at all: EXPLAIN ANALYZE SELECT member_id FROM marital_matrix m1 WHERE status & B'10000' = B'10000'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_marital_matrix_single on marital_matrix m1 (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 loops=1) Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit") Total runtime: 451.162 ms (3 rows) METHOD 4: Using an IN statement. This method seems to be very fussy about using the index, and I have at some point made it use the index when there are less than 3 possibilites. Also, for fields other than Marital Status, users will be able to select many bits for their own profile, which means there would be many permutations: EXPLAIN ANALYZE SELECT member_id FROM marital_matrix WHERE status & B'11011' IN (B'10000',B'01000'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual time=17.845..288.279 rows=47525 loops=1) Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'00001'::"bit")) Total runtime: 488.651 ms (3 rows) Method 3 is the only one that used the index, but the only really acceptable method here is Method 1. My questions are... - Is there any hope in getting this to use an efficient index? - Any mathmaticians know if there is a way to reorder my bitwise comparison to have the operator use = and not an != (perhaps to force an index)? (AFAIK, the answer to the second question is no) If anyone could offer any performance tips here I'd really appreciate it. I imagine that having this schema wouldn't last an hour with the amount of CPU cycles it would be consuming on math operations. Also, I have read the thread that was posted here by Daniel in August: http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php I have spoke with Daniel on this issue and we both agree it's very difficult to find a solution that can scale to very large sites. I would very much appreciate any advice that some experienced users may have to offer me for such a situation. TIA Patrick
pgsql-performance by date: