Re: Do we want a hashset type? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Do we want a hashset type? |
Date | |
Msg-id | 24e60a0e-b067-7f74-9f48-09f7d056144e@enterprisedb.com Whole thread Raw |
In response to | Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: Do we want a hashset type?
|
List | pgsql-hackers |
On 6/19/23 13:33, Joel Jacobson wrote: > On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote: >> AFAICS the standard only defines arrays and multisets. Arrays are pretty >> much the thing we have, including the ARRAY[] constructor etc. Multisets >> are similar to hashset discussed here, except that it tracks the number >> of elements for each value (which would be trivial in hashset). >> >> So if we want to make this a built-in feature, maybe we should aim to do >> the multiset thing, with the standard SQL syntax? Extending the grammar >> should not be hard, I think. I'm not sure of the underlying code >> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a >> lot of separate code doing that. > > Multisets handle duplicates uniquely, this may bring unexpected issues. Sets > and multisets have distinct utility in C++, Rust, Java, etc. However, sets are > more fundamental and prevalent in std libs than multisets. > What unexpected issues you mean? Sure, if someone uses multisets as if they were sets (so ignoring the handling of duplicates), things will go booom! quickly. I imagined (if we ended up doing MULTISET) we'd provide interface (e.g. operators) that'd allow perhaps help with this. > Despite SQL's multiset possibility, a distinct hashset type is my preference, > helping appropriate data structure choice and reducing misuse. > > The necessity of multisets is vague beyond standards compliance. > True - we haven't had any requests/proposal to implement MULTISETs. I've looked at the SQL standard primarily to check if maybe there's some precedent that'd give us guidance on the SQL syntax etc. And I think multisets are that - even if we end up not implementing them, it'd be sad to have unnecessarily inconsistent syntax (in case someone decides to add multisets in the future). We could invent "SET" data type, so while standard has ARRAY / MULTISET, we'd have ARRAY / MULTISET / SET, and the difference between the last two would be just handling of duplicates. The other way to look at sets is that they are pretty similar to arrays, except that there are no duplicates and order does not matter. Sure, the on-disk format and code is different, but from the SQL perspective it'd be nice to allow using sets in most places where arrays are allowed (which is what the standard does for MULTISETS, more or less). That'd mean we could probably search through gram.y for places working with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and make them work with sets too, say by having SET_SUBLINK instead of ARRAY_SUBLINK, set_expression instead of array_expression, etc. This might be also "consistent" with defining hashset type using CREATE TYPE with ELEMENT, because we consider the type to be "array". So that would be polymorphic type, but we don't have pre-defined array for every type (and I'm not sure we want to). Of course, maybe there's some fatal flaw in these idea, I don't know. And I don't want to move the goalposts too far - but it seems like this might make some stuff actually simpler to implement (by piggy-backing on the existing array infrastructure). A mostly unrelated thought - I wonder if this might be somehow related to the foreign key array patch ([1] might be the most recent attempt in this direction). Not to hashset itself, but I recalled these patches because it'd mean we don't need the separate "edges" link table (so the hashset column would be the think backing the FK). [1] https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: