Thread: CUBE_MAX_DIM
Hi, Someone contacted me about increasing CUBE_MAX_DIM in contrib/cube/cubedata.h (in the community RPMs). The current value is 100 with the following comment: * This limit is pretty arbitrary, but don't make it so large that you * risk overflow in sizing calculations. They said they use 500, and never had a problem. I never added such patches to the RPMS, and will not -- but wanted to askif we can safely increase it in upstream? Regards, -- Devrim Gündüz Open Source Solution Architect, Red Hat Certified Engineer Twitter: @DevrimGunduz , @DevrimGunduzTR
Attachment
Hello,
The problem with higher dimension cubes is that starting with dimensionality of ~52 the "distance" metrics in 64-bit float have less than a single bit per dimension in mantissa, making cubes indistinguishable. Developers for facial recognition software had a chat about that on russian postgres telegram group https://t.me/pgsql. Their problem was that they had 128-dimensional points, recompiled postgres - distances weren't helpful, and GIST KNN severely degraded to almost full scans. They had to change the number of facial features to smaller in order to make KNN search work.
Floating point overflow isn't that much of a risk per se, worst case scenario it becomes an Infinity or 0 which are usually acceptable in those contexts.
While mathematically possible, there are implementation issues with higher dimension cubes. I'm ok with raising the limit if such nuances get a mention in docs.
On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz <devrim@gunduz.org> wrote:
Hi,
Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:
* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.
They said they use 500, and never had a problem. I never added such patches to the RPMS, and will not -- but wanted to ask if we can safely increase it in upstream?
Regards,
--
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes: > Someone contacted me about increasing CUBE_MAX_DIM > in contrib/cube/cubedata.h (in the community RPMs). The current value > is 100 with the following comment: > * This limit is pretty arbitrary, but don't make it so large that you > * risk overflow in sizing calculations. > They said they use 500, and never had a problem. I guess I'm wondering what's the use-case. 100 already seems an order of magnitude more than anyone could want. Or, if it's not enough, why does raising the limit just 5x enable any large set of new applications? The practical issue here is that, since the data requires 16 bytes per dimension (plus a little bit of overhead), we'd be talking about increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000 bytes. And cube is not toastable, so that couldn't be compressed or shoved out-of-line. Maybe your OP never had a problem with it, but plenty of use-cases would have "tuple too large" failures due to not having room on a heap page for whatever other data they want in the row. Even a non-toastable 2KB field is going to give the tuple toaster algorithm problems, as it'll end up shoving every other toastable field out-of-line in an ultimately vain attempt to bring the tuple size below 2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where it is now without any other changes. A more credible proposal would be to make cube toast-aware and then raise the limit to ~1GB ... but that would take a significant amount of work, and we still haven't got a use-case justifying it. I think I'd counsel storing such data as plain float8 arrays, which do have the necessary storage infrastructure. Is there something about the cube operators that's particularly missing? regards, tom lane
> > Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes: > > Someone contacted me about increasing CUBE_MAX_DIM > > in contrib/cube/cubedata.h (in the community RPMs). The current value > > is 100 with the following comment: > > > * This limit is pretty arbitrary, but don't make it so large that you > > * risk overflow in sizing calculations. > > > They said they use 500, and never had a problem. > > I guess I'm wondering what's the use-case. 100 already seems an order of > magnitude more than anyone could want. Or, if it's not enough, why does > raising the limit just 5x enable any large set of new applications? The dimensionality of embeddings generated by deep neural networks can be high. Google BERT has 768 dimensions for example. I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I havetried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful forthese applications. There are other pre-processing techniques that can be used to improved the speed of the search, butit still ends up with a kNN search in a high-ish dimensional space. > The practical issue here is that, since the data requires 16 bytes per > dimension (plus a little bit of overhead), we'd be talking about > increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000 > bytes. And cube is not toastable, so that couldn't be compressed or > shoved out-of-line. Maybe your OP never had a problem with it, but > plenty of use-cases would have "tuple too large" failures due to not > having room on a heap page for whatever other data they want in the row. > > Even a non-toastable 2KB field is going to give the tuple toaster > algorithm problems, as it'll end up shoving every other toastable field > out-of-line in an ultimately vain attempt to bring the tuple size below > 2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where > it is now without any other changes. > > A more credible proposal would be to make cube toast-aware and then > raise the limit to ~1GB ... but that would take a significant amount > of work, and we still haven't got a use-case justifying it. > > I think I'd counsel storing such data as plain float8 arrays, which > do have the necessary storage infrastructure. Is there something > about the cube operators that's particularly missing? > The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays. > regards, tom lane Best regards, Alastair
Alastair McKinley <a.mckinley@analyticsengines.com> writes: > I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (Ihave tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be usefulfor these applications. There are other pre-processing techniques that can be used to improved the speed of the search,but it still ends up with a kNN search in a high-ish dimensional space. Is there a way to fix the numerical instability involved? If we could do that, then we'd definitely have a use-case justifying the work to make cube toastable. regards, tom lane
> From: Tom Lane <tgl@sss.pgh.pa.us> > Sent: 25 June 2020 17:43 > > Alastair McKinley <a.mckinley@analyticsengines.com> writes: > > I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form(I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be usefulfor these applications. There are other pre-processing techniques that can be used to improved the speed of the search,but it still ends up with a kNN search in a high-ish dimensional space. > > Is there a way to fix the numerical instability involved? If we could do > that, then we'd definitely have a use-case justifying the work to make > cube toastable. I am not that familiar with the nature of the numerical instability, but it might be worth noting for additional contextthat for the NN use case: - The value of each dimension is likely to be between 0 and 1 - The L1 distance is meaningful for high numbers of dimensions, which *possibly* suffers less from the numeric issues thaneuclidean distance. The numerical stability isn't the only issue for high dimensional kNN, the GiST search performance currently degrades withincreasing N towards sequential scan performance, although maybe they are related? > regards, tom lane Best regards, Alastair