Thread: cast bytea to double precision[]

cast bytea to double precision[]

From

Mathieu Dubois

Date:

24 July 2011, 15:28:23

Hello,

I have a column of type bytea which contains a vector encoded in a
string (elements are separated by spaces).
The problem is that it uses a lot of storage space.

I would like to convert this column to an ARRAY column with ALTER COLUMN
TYPE.

If I understand correctly, I have to specify a function to cast bytea to
double precision[]. Is it correct?

If this is correct, how to do that?

I use regexp_split_to_array to transform the bytea to a text[] but I
can't figure out how to convert it to double precision[].

I'm using postgres 8.4.

Any idea?

Thanks in advance,
Mathieu

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

24 July 2011, 16:03:43

I have found a solution by myself for the conversion:
regexp_split_to_array(sig_vector, E',')::double precision[] (elements
are in fact separated by commas).

To convert my column I have used:
ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
regexp_split_to_array(sig_vector, E',')::double precision[];

Is that correct?
Is it correct to pass the column name to regexp_split_to_array()?

Mathieu

On 07/24/2011 08:18 PM, Mathieu Dubois wrote:
> Hello,
>
> I have a column of type bytea which contains a vector encoded in a
> string (elements are separated by spaces).
> The problem is that it uses a lot of storage space.
>
> I would like to convert this column to an ARRAY column with ALTER
> COLUMN TYPE.
>
> If I understand correctly, I have to specify a function to cast bytea
> to double precision[]. Is it correct?
>
> If this is correct, how to do that?
>
> I use regexp_split_to_array to transform the bytea to a text[] but I
> can't figure out how to convert it to double precision[].
>
> I'm using postgres 8.4.
>
> Any idea?
>
> Thanks in advance,
> Mathieu
>

Re: cast bytea to double precision[]

From

Merlin Moncure

Date:

25 July 2011, 12:54:41

On Sun, Jul 24, 2011 at 2:03 PM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote:
> I have found a solution by myself for the conversion:
> regexp_split_to_array(sig_vector, E',')::double precision[] (elements are in
> fact separated by commas).
>
> To convert my column I have used:
> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
> regexp_split_to_array(sig_vector, E',')::double precision[];
>
> Is that correct?
> Is it correct to pass the column name to regexp_split_to_array()?

Yeah -- you are just passing a column's data into a function as an
argument -- standard practice.  This will work -- your bytea is really
a text column, so it's just a matter of breaking up the string.
regexp_* functions are great for that.

merlin

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

25 July 2011, 12:58:11

On 07/25/2011 05:54 PM, Merlin Moncure wrote:
> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu Dubois<mathieu.dubois@limsi.fr>  wrote:
>> I have found a solution by myself for the conversion:
>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements are in
>> fact separated by commas).
>>
>> To convert my column I have used:
>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>
>> Is that correct?
>> Is it correct to pass the column name to regexp_split_to_array()?
> Yeah -- you are just passing a column's data into a function as an
> argument -- standard practice.  This will work -- your bytea is really
> a text column, so it's just a matter of breaking up the string.
> regexp_* functions are great for that.
Thank you very much for your reply.

I will launch the conversion right now.

Mathieu

> merlin
>

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

26 July 2011, 04:45:26

Hello,

Le 25/07/2011 17:58, Mathieu Dubois a écrit :
> On 07/25/2011 05:54 PM, Merlin Moncure wrote:
>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu
>> Dubois<mathieu.dubois@limsi.fr> wrote:
>>> I have found a solution by myself for the conversion:
>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements
>>> are in
>>> fact separated by commas).
>>>
>>> To convert my column I have used:
>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>>
>>> Is that correct?
>>> Is it correct to pass the column name to regexp_split_to_array()?
>> Yeah -- you are just passing a column's data into a function as an
>> argument -- standard practice. This will work -- your bytea is really
>> a text column, so it's just a matter of breaking up the string.
>> regexp_* functions are great for that.
> Thank you very much for your reply.
>
> I will launch the conversion right now.
>
The main reason to do this was to have smaller backups.
The size of a compressed backup was around 1GB with bytea.
I have converted the columns (on a copy of the database) but the
expected gain are not here!
With double precision[] it is still around 1GB (a little bit smaller but
just a few MB).

Also the size on the disk is not smaller.
I have listed the content of /var/lib/postgres/8.4/main/base with du and
the 2 versions have the same size (3.1GB).

Does it make sense?
My hypothesis is that the compression algorithm is able to find
regularities the data so it finds the same regularity in bytea and in
double precision[].

Is there any advantage to use double precision[] over bytea in my case?


> Mathieu
>
>> merlin
>>
>
>

Re: cast bytea to double precision[]

From

Merlin Moncure

Date:

26 July 2011, 11:30:51

On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote:
> Hello,
>
> Le 25/07/2011 17:58, Mathieu Dubois a écrit :
>>
>> On 07/25/2011 05:54 PM, Merlin Moncure wrote:
>>>
>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu
>>> Dubois<mathieu.dubois@limsi.fr> wrote:
>>>>
>>>> I have found a solution by myself for the conversion:
>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements
>>>> are in
>>>> fact separated by commas).
>>>>
>>>> To convert my column I have used:
>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>>>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>>>
>>>> Is that correct?
>>>> Is it correct to pass the column name to regexp_split_to_array()?
>>>
>>> Yeah -- you are just passing a column's data into a function as an
>>> argument -- standard practice. This will work -- your bytea is really
>>> a text column, so it's just a matter of breaking up the string.
>>> regexp_* functions are great for that.
>>
>> Thank you very much for your reply.
>>
>> I will launch the conversion right now.
>>
> The main reason to do this was to have smaller backups.
> The size of a compressed backup was around 1GB with bytea.
> I have converted the columns (on a copy of the database) but the expected
> gain are not here!
> With double precision[] it is still around 1GB (a little bit smaller but
> just a few MB).
>
> Also the size on the disk is not smaller.
> I have listed the content of /var/lib/postgres/8.4/main/base with du and the
> 2 versions have the same size (3.1GB).
>
> Does it make sense?
> My hypothesis is that the compression algorithm is able to find regularities
> the data so it finds the same regularity in bytea and in double precision[].
>
> Is there any advantage to use double precision[] over bytea in my case?

probably not -- arrays can be significantly smaller than a set of
individual tuples each holding one value because of the tuple
overhead, but you still have to pay for the array header and a 4 byte
length/null indicator per element.

A packed string is often the smallest way to store data, although not
necessarily the best.  A double precision[] comes with a lot of syntax
advantages.

merlin

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

26 July 2011, 14:25:46

On 07/26/2011 04:30 PM, Merlin Moncure wrote:
> On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr>  wrote:
>> Hello,
>>
>> Le 25/07/2011 17:58, Mathieu Dubois a écrit :
>>> On 07/25/2011 05:54 PM, Merlin Moncure wrote:
>>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu
>>>> Dubois<mathieu.dubois@limsi.fr>  wrote:
>>>>> I have found a solution by myself for the conversion:
>>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements
>>>>> are in
>>>>> fact separated by commas).
>>>>>
>>>>> To convert my column I have used:
>>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>>>>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>>>>
>>>>> Is that correct?
>>>>> Is it correct to pass the column name to regexp_split_to_array()?
>>>> Yeah -- you are just passing a column's data into a function as an
>>>> argument -- standard practice. This will work -- your bytea is really
>>>> a text column, so it's just a matter of breaking up the string.
>>>> regexp_* functions are great for that.
>>> Thank you very much for your reply.
>>>
>>> I will launch the conversion right now.
>>>
>> The main reason to do this was to have smaller backups.
>> The size of a compressed backup was around 1GB with bytea.
>> I have converted the columns (on a copy of the database) but the expected
>> gain are not here!
>> With double precision[] it is still around 1GB (a little bit smaller but
>> just a few MB).
>>
>> Also the size on the disk is not smaller.
>> I have listed the content of /var/lib/postgres/8.4/main/base with du and the
>> 2 versions have the same size (3.1GB).
>>
>> Does it make sense?
>> My hypothesis is that the compression algorithm is able to find regularities
>> the data so it finds the same regularity in bytea and in double precision[].
>>
>> Is there any advantage to use double precision[] over bytea in my case?
> probably not -- arrays can be significantly smaller than a set of
> individual tuples each holding one value because of the tuple
> overhead, but you still have to pay for the array header and a 4 byte
> length/null indicator per element.
>
> A packed string is often the smallest way to store data, although not
> necessarily the best.  A double precision[] comes with a lot of syntax
> advantages.
Thanks for your advice!

I find the result surprising because the floats are encoded with a lot
of characters (something like 20) while a double is 8 bytes.

I have tried to run VACCUM but it changed nothing...

All of my code is based on strings so I won't take time to modify it if
there is no gain.


Mathieu

> merlin
>

Re: cast bytea to double precision[]

From

Merlin Moncure

Date:

26 July 2011, 15:40:50

On Tue, Jul 26, 2011 at 12:25 PM, Mathieu Dubois
<mathieu.dubois@limsi.fr> wrote:
> On 07/26/2011 04:30 PM, Merlin Moncure wrote:
>>
>> On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr>
>>  wrote:
>>>
>>> Hello,
>>>
>>> Le 25/07/2011 17:58, Mathieu Dubois a écrit :
>>>>
>>>> On 07/25/2011 05:54 PM, Merlin Moncure wrote:
>>>>>
>>>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu
>>>>> Dubois<mathieu.dubois@limsi.fr>  wrote:
>>>>>>
>>>>>> I have found a solution by myself for the conversion:
>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements
>>>>>> are in
>>>>>> fact separated by commas).
>>>>>>
>>>>>> To convert my column I have used:
>>>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>>>>>
>>>>>> Is that correct?
>>>>>> Is it correct to pass the column name to regexp_split_to_array()?
>>>>>
>>>>> Yeah -- you are just passing a column's data into a function as an
>>>>> argument -- standard practice. This will work -- your bytea is really
>>>>> a text column, so it's just a matter of breaking up the string.
>>>>> regexp_* functions are great for that.
>>>>
>>>> Thank you very much for your reply.
>>>>
>>>> I will launch the conversion right now.
>>>>
>>> The main reason to do this was to have smaller backups.
>>> The size of a compressed backup was around 1GB with bytea.
>>> I have converted the columns (on a copy of the database) but the expected
>>> gain are not here!
>>> With double precision[] it is still around 1GB (a little bit smaller but
>>> just a few MB).
>>>
>>> Also the size on the disk is not smaller.
>>> I have listed the content of /var/lib/postgres/8.4/main/base with du and
>>> the
>>> 2 versions have the same size (3.1GB).
>>>
>>> Does it make sense?
>>> My hypothesis is that the compression algorithm is able to find
>>> regularities
>>> the data so it finds the same regularity in bytea and in double
>>> precision[].
>>>
>>> Is there any advantage to use double precision[] over bytea in my case?
>>
>> probably not -- arrays can be significantly smaller than a set of
>> individual tuples each holding one value because of the tuple
>> overhead, but you still have to pay for the array header and a 4 byte
>> length/null indicator per element.
>>
>> A packed string is often the smallest way to store data, although not
>> necessarily the best.  A double precision[] comes with a lot of syntax
>> advantages.
>
> Thanks for your advice!
>
> I find the result surprising because the floats are encoded with a lot of
> characters (something like 20) while a double is 8 bytes.
>
> I have tried to run VACCUM but it changed nothing...
>
> All of my code is based on strings so I won't take time to modify it if
> there is no gain.

right. remember that with each array element there is a 4 byte
indicator (which is the length and the null flag) for each element.
So array element size is 12 bytes, not 8, and you have the overhead of
the array itself which contains dimension information and type oid.
So, it's about a wash vs 15 digits precision floating point and one
byte delimiter.

If all the elements are known not null and the internally stored type
is fixed length, it's hypothetically possible to not have to store the
length/null indicator, but postgres does not do that and it might not
be worth it even if you wanted to.

merlin

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

26 July 2011, 16:02:21

On 07/26/2011 08:40 PM, Merlin Moncure wrote:
> On Tue, Jul 26, 2011 at 12:25 PM, Mathieu Dubois
> <mathieu.dubois@limsi.fr>  wrote:
>> On 07/26/2011 04:30 PM, Merlin Moncure wrote:
>>> On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr>
>>>   wrote:
>>>> Hello,
>>>>
>>>> Le 25/07/2011 17:58, Mathieu Dubois a écrit :
>>>>> On 07/25/2011 05:54 PM, Merlin Moncure wrote:
>>>>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu
>>>>>> Dubois<mathieu.dubois@limsi.fr>    wrote:
>>>>>>> I have found a solution by myself for the conversion:
>>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements
>>>>>>> are in
>>>>>>> fact separated by commas).
>>>>>>>
>>>>>>> To convert my column I have used:
>>>>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING
>>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[];
>>>>>>>
>>>>>>> Is that correct?
>>>>>>> Is it correct to pass the column name to regexp_split_to_array()?
>>>>>> Yeah -- you are just passing a column's data into a function as an
>>>>>> argument -- standard practice. This will work -- your bytea is really
>>>>>> a text column, so it's just a matter of breaking up the string.
>>>>>> regexp_* functions are great for that.
>>>>> Thank you very much for your reply.
>>>>>
>>>>> I will launch the conversion right now.
>>>>>
>>>> The main reason to do this was to have smaller backups.
>>>> The size of a compressed backup was around 1GB with bytea.
>>>> I have converted the columns (on a copy of the database) but the expected
>>>> gain are not here!
>>>> With double precision[] it is still around 1GB (a little bit smaller but
>>>> just a few MB).
>>>>
>>>> Also the size on the disk is not smaller.
>>>> I have listed the content of /var/lib/postgres/8.4/main/base with du and
>>>> the
>>>> 2 versions have the same size (3.1GB).
>>>>
>>>> Does it make sense?
>>>> My hypothesis is that the compression algorithm is able to find
>>>> regularities
>>>> the data so it finds the same regularity in bytea and in double
>>>> precision[].
>>>>
>>>> Is there any advantage to use double precision[] over bytea in my case?
>>> probably not -- arrays can be significantly smaller than a set of
>>> individual tuples each holding one value because of the tuple
>>> overhead, but you still have to pay for the array header and a 4 byte
>>> length/null indicator per element.
>>>
>>> A packed string is often the smallest way to store data, although not
>>> necessarily the best.  A double precision[] comes with a lot of syntax
>>> advantages.
>> Thanks for your advice!
>>
>> I find the result surprising because the floats are encoded with a lot of
>> characters (something like 20) while a double is 8 bytes.
>>
>> I have tried to run VACCUM but it changed nothing...
>>
>> All of my code is based on strings so I won't take time to modify it if
>> there is no gain.
> right. remember that with each array element there is a 4 byte
> indicator (which is the length and the null flag) for each element.
> So array element size is 12 bytes, not 8, and you have the overhead of
> the array itself which contains dimension information and type oid.
> So, it's about a wash vs 15 digits precision floating point and one
> byte delimiter.
Are you sure that the length of each element is stored?
That seems like a waste of space (if you now that you are in a double
precision[] you know that each element is 8 bytes - or maybe array can
contain different types).
Hum, I forgot to say that I have several hundred or even thousands
elements so saving even few bytes/element might ends up in giga-bytes.
> If all the elements are known not null and the internally stored type
> is fixed length, it's hypothetically possible to not have to store the
> length/null indicator, but postgres does not do that and it might not
> be worth it even if you wanted to.
I initially wanted to store my vector directly as it is in memory.
It's a bad practice since floats can be different from one machine to
the other even if they all use IEEE 754 for instance (byte order, 32 or
64bits machines).
I remembered having some headaches trying to send floats over the
network between old G4 Macs and PC (G4 were using high-endian while
Intel machines are little-endian).

Moreover I use different programming languages (ranging from Ada to
Python) so text was the more common format.
I know that Ada has some ways to specify the memory layout of your
objects but I don't have a lot of experience with that (and it's
probably not the place to ask).
C often use the closest type implemented in hardware so you in a mixed
environment you have to deal with that

Maybe a simple trick is simply to put less data (removing trailing
zeros) and/or fixed relative precision.
> merlin
Thank you again for your time.
The postgres list is always helpful :)

Mathieu

Re: cast bytea to double precision[]

From

Merlin Moncure

Date:

26 July 2011, 16:19:50

On Tue, Jul 26, 2011 at 2:02 PM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote:
> Are you sure that the length of each element is stored?

no, I was wrong (I just checked the code) -- see CopyArrayEls. Null
indicator is stored in a bitmap. hm.

merlin

Re: cast bytea to double precision[]

From

Mathieu Dubois

Date:

27 July 2011, 06:06:54

Hello,

This morning I conducted a simple test: I generated 10 000 samples of
100-dimensional data (uniformly distributed in [-10 000; 10 000]) and
inserted them into 2 newly created bases with only 1 table each. In one
case the table has a bytea column (table OID 2102835) on the other a
double precision[] (table OID 2102836). I generated scripts to insert
the data in the 2 formats.

After insertion, 'du -h' gives:
22M    ./2102835
15M    ./2102836

So the double precision[] is really smaller.

I guess that in my case (altering column) there may be some kind of
cache or maybe the toast table... It's really puzzling.

Mathieu
On 07/26/2011 09:19 PM, Merlin Moncure wrote:
> On Tue, Jul 26, 2011 at 2:02 PM, Mathieu Dubois<mathieu.dubois@limsi.fr>  wrote:
>> Are you sure that the length of each element is stored?
> no, I was wrong (I just checked the code) -- see CopyArrayEls. Null
> indicator is stored in a bitmap. hm.
>
> merlin
>