Thread: cast bytea to double precision[]
Hello, I have a column of type bytea which contains a vector encoded in a string (elements are separated by spaces). The problem is that it uses a lot of storage space. I would like to convert this column to an ARRAY column with ALTER COLUMN TYPE. If I understand correctly, I have to specify a function to cast bytea to double precision[]. Is it correct? If this is correct, how to do that? I use regexp_split_to_array to transform the bytea to a text[] but I can't figure out how to convert it to double precision[]. I'm using postgres 8.4. Any idea? Thanks in advance, Mathieu
I have found a solution by myself for the conversion: regexp_split_to_array(sig_vector, E',')::double precision[] (elements are in fact separated by commas). To convert my column I have used: ALTER TABLE my_table ALTER sig_vector TO double precision[] USING regexp_split_to_array(sig_vector, E',')::double precision[]; Is that correct? Is it correct to pass the column name to regexp_split_to_array()? Mathieu On 07/24/2011 08:18 PM, Mathieu Dubois wrote: > Hello, > > I have a column of type bytea which contains a vector encoded in a > string (elements are separated by spaces). > The problem is that it uses a lot of storage space. > > I would like to convert this column to an ARRAY column with ALTER > COLUMN TYPE. > > If I understand correctly, I have to specify a function to cast bytea > to double precision[]. Is it correct? > > If this is correct, how to do that? > > I use regexp_split_to_array to transform the bytea to a text[] but I > can't figure out how to convert it to double precision[]. > > I'm using postgres 8.4. > > Any idea? > > Thanks in advance, > Mathieu >
On Sun, Jul 24, 2011 at 2:03 PM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote: > I have found a solution by myself for the conversion: > regexp_split_to_array(sig_vector, E',')::double precision[] (elements are in > fact separated by commas). > > To convert my column I have used: > ALTER TABLE my_table ALTER sig_vector TO double precision[] USING > regexp_split_to_array(sig_vector, E',')::double precision[]; > > Is that correct? > Is it correct to pass the column name to regexp_split_to_array()? Yeah -- you are just passing a column's data into a function as an argument -- standard practice. This will work -- your bytea is really a text column, so it's just a matter of breaking up the string. regexp_* functions are great for that. merlin
On 07/25/2011 05:54 PM, Merlin Moncure wrote: > On Sun, Jul 24, 2011 at 2:03 PM, Mathieu Dubois<mathieu.dubois@limsi.fr> wrote: >> I have found a solution by myself for the conversion: >> regexp_split_to_array(sig_vector, E',')::double precision[] (elements are in >> fact separated by commas). >> >> To convert my column I have used: >> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >> regexp_split_to_array(sig_vector, E',')::double precision[]; >> >> Is that correct? >> Is it correct to pass the column name to regexp_split_to_array()? > Yeah -- you are just passing a column's data into a function as an > argument -- standard practice. This will work -- your bytea is really > a text column, so it's just a matter of breaking up the string. > regexp_* functions are great for that. Thank you very much for your reply. I will launch the conversion right now. Mathieu > merlin >
Hello, Le 25/07/2011 17:58, Mathieu Dubois a écrit : > On 07/25/2011 05:54 PM, Merlin Moncure wrote: >> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu >> Dubois<mathieu.dubois@limsi.fr> wrote: >>> I have found a solution by myself for the conversion: >>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements >>> are in >>> fact separated by commas). >>> >>> To convert my column I have used: >>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >>> regexp_split_to_array(sig_vector, E',')::double precision[]; >>> >>> Is that correct? >>> Is it correct to pass the column name to regexp_split_to_array()? >> Yeah -- you are just passing a column's data into a function as an >> argument -- standard practice. This will work -- your bytea is really >> a text column, so it's just a matter of breaking up the string. >> regexp_* functions are great for that. > Thank you very much for your reply. > > I will launch the conversion right now. > The main reason to do this was to have smaller backups. The size of a compressed backup was around 1GB with bytea. I have converted the columns (on a copy of the database) but the expected gain are not here! With double precision[] it is still around 1GB (a little bit smaller but just a few MB). Also the size on the disk is not smaller. I have listed the content of /var/lib/postgres/8.4/main/base with du and the 2 versions have the same size (3.1GB). Does it make sense? My hypothesis is that the compression algorithm is able to find regularities the data so it finds the same regularity in bytea and in double precision[]. Is there any advantage to use double precision[] over bytea in my case? > Mathieu > >> merlin >> > >
On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote: > Hello, > > Le 25/07/2011 17:58, Mathieu Dubois a écrit : >> >> On 07/25/2011 05:54 PM, Merlin Moncure wrote: >>> >>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu >>> Dubois<mathieu.dubois@limsi.fr> wrote: >>>> >>>> I have found a solution by myself for the conversion: >>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements >>>> are in >>>> fact separated by commas). >>>> >>>> To convert my column I have used: >>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >>>> regexp_split_to_array(sig_vector, E',')::double precision[]; >>>> >>>> Is that correct? >>>> Is it correct to pass the column name to regexp_split_to_array()? >>> >>> Yeah -- you are just passing a column's data into a function as an >>> argument -- standard practice. This will work -- your bytea is really >>> a text column, so it's just a matter of breaking up the string. >>> regexp_* functions are great for that. >> >> Thank you very much for your reply. >> >> I will launch the conversion right now. >> > The main reason to do this was to have smaller backups. > The size of a compressed backup was around 1GB with bytea. > I have converted the columns (on a copy of the database) but the expected > gain are not here! > With double precision[] it is still around 1GB (a little bit smaller but > just a few MB). > > Also the size on the disk is not smaller. > I have listed the content of /var/lib/postgres/8.4/main/base with du and the > 2 versions have the same size (3.1GB). > > Does it make sense? > My hypothesis is that the compression algorithm is able to find regularities > the data so it finds the same regularity in bytea and in double precision[]. > > Is there any advantage to use double precision[] over bytea in my case? probably not -- arrays can be significantly smaller than a set of individual tuples each holding one value because of the tuple overhead, but you still have to pay for the array header and a 4 byte length/null indicator per element. A packed string is often the smallest way to store data, although not necessarily the best. A double precision[] comes with a lot of syntax advantages. merlin
On 07/26/2011 04:30 PM, Merlin Moncure wrote: > On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr> wrote: >> Hello, >> >> Le 25/07/2011 17:58, Mathieu Dubois a écrit : >>> On 07/25/2011 05:54 PM, Merlin Moncure wrote: >>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu >>>> Dubois<mathieu.dubois@limsi.fr> wrote: >>>>> I have found a solution by myself for the conversion: >>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements >>>>> are in >>>>> fact separated by commas). >>>>> >>>>> To convert my column I have used: >>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >>>>> regexp_split_to_array(sig_vector, E',')::double precision[]; >>>>> >>>>> Is that correct? >>>>> Is it correct to pass the column name to regexp_split_to_array()? >>>> Yeah -- you are just passing a column's data into a function as an >>>> argument -- standard practice. This will work -- your bytea is really >>>> a text column, so it's just a matter of breaking up the string. >>>> regexp_* functions are great for that. >>> Thank you very much for your reply. >>> >>> I will launch the conversion right now. >>> >> The main reason to do this was to have smaller backups. >> The size of a compressed backup was around 1GB with bytea. >> I have converted the columns (on a copy of the database) but the expected >> gain are not here! >> With double precision[] it is still around 1GB (a little bit smaller but >> just a few MB). >> >> Also the size on the disk is not smaller. >> I have listed the content of /var/lib/postgres/8.4/main/base with du and the >> 2 versions have the same size (3.1GB). >> >> Does it make sense? >> My hypothesis is that the compression algorithm is able to find regularities >> the data so it finds the same regularity in bytea and in double precision[]. >> >> Is there any advantage to use double precision[] over bytea in my case? > probably not -- arrays can be significantly smaller than a set of > individual tuples each holding one value because of the tuple > overhead, but you still have to pay for the array header and a 4 byte > length/null indicator per element. > > A packed string is often the smallest way to store data, although not > necessarily the best. A double precision[] comes with a lot of syntax > advantages. Thanks for your advice! I find the result surprising because the floats are encoded with a lot of characters (something like 20) while a double is 8 bytes. I have tried to run VACCUM but it changed nothing... All of my code is based on strings so I won't take time to modify it if there is no gain. Mathieu > merlin >
On Tue, Jul 26, 2011 at 12:25 PM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote: > On 07/26/2011 04:30 PM, Merlin Moncure wrote: >> >> On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr> >> wrote: >>> >>> Hello, >>> >>> Le 25/07/2011 17:58, Mathieu Dubois a écrit : >>>> >>>> On 07/25/2011 05:54 PM, Merlin Moncure wrote: >>>>> >>>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu >>>>> Dubois<mathieu.dubois@limsi.fr> wrote: >>>>>> >>>>>> I have found a solution by myself for the conversion: >>>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements >>>>>> are in >>>>>> fact separated by commas). >>>>>> >>>>>> To convert my column I have used: >>>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >>>>>> regexp_split_to_array(sig_vector, E',')::double precision[]; >>>>>> >>>>>> Is that correct? >>>>>> Is it correct to pass the column name to regexp_split_to_array()? >>>>> >>>>> Yeah -- you are just passing a column's data into a function as an >>>>> argument -- standard practice. This will work -- your bytea is really >>>>> a text column, so it's just a matter of breaking up the string. >>>>> regexp_* functions are great for that. >>>> >>>> Thank you very much for your reply. >>>> >>>> I will launch the conversion right now. >>>> >>> The main reason to do this was to have smaller backups. >>> The size of a compressed backup was around 1GB with bytea. >>> I have converted the columns (on a copy of the database) but the expected >>> gain are not here! >>> With double precision[] it is still around 1GB (a little bit smaller but >>> just a few MB). >>> >>> Also the size on the disk is not smaller. >>> I have listed the content of /var/lib/postgres/8.4/main/base with du and >>> the >>> 2 versions have the same size (3.1GB). >>> >>> Does it make sense? >>> My hypothesis is that the compression algorithm is able to find >>> regularities >>> the data so it finds the same regularity in bytea and in double >>> precision[]. >>> >>> Is there any advantage to use double precision[] over bytea in my case? >> >> probably not -- arrays can be significantly smaller than a set of >> individual tuples each holding one value because of the tuple >> overhead, but you still have to pay for the array header and a 4 byte >> length/null indicator per element. >> >> A packed string is often the smallest way to store data, although not >> necessarily the best. A double precision[] comes with a lot of syntax >> advantages. > > Thanks for your advice! > > I find the result surprising because the floats are encoded with a lot of > characters (something like 20) while a double is 8 bytes. > > I have tried to run VACCUM but it changed nothing... > > All of my code is based on strings so I won't take time to modify it if > there is no gain. right. remember that with each array element there is a 4 byte indicator (which is the length and the null flag) for each element. So array element size is 12 bytes, not 8, and you have the overhead of the array itself which contains dimension information and type oid. So, it's about a wash vs 15 digits precision floating point and one byte delimiter. If all the elements are known not null and the internally stored type is fixed length, it's hypothetically possible to not have to store the length/null indicator, but postgres does not do that and it might not be worth it even if you wanted to. merlin
On 07/26/2011 08:40 PM, Merlin Moncure wrote: > On Tue, Jul 26, 2011 at 12:25 PM, Mathieu Dubois > <mathieu.dubois@limsi.fr> wrote: >> On 07/26/2011 04:30 PM, Merlin Moncure wrote: >>> On Tue, Jul 26, 2011 at 2:45 AM, Mathieu Dubois<mathieu.dubois@limsi.fr> >>> wrote: >>>> Hello, >>>> >>>> Le 25/07/2011 17:58, Mathieu Dubois a écrit : >>>>> On 07/25/2011 05:54 PM, Merlin Moncure wrote: >>>>>> On Sun, Jul 24, 2011 at 2:03 PM, Mathieu >>>>>> Dubois<mathieu.dubois@limsi.fr> wrote: >>>>>>> I have found a solution by myself for the conversion: >>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[] (elements >>>>>>> are in >>>>>>> fact separated by commas). >>>>>>> >>>>>>> To convert my column I have used: >>>>>>> ALTER TABLE my_table ALTER sig_vector TO double precision[] USING >>>>>>> regexp_split_to_array(sig_vector, E',')::double precision[]; >>>>>>> >>>>>>> Is that correct? >>>>>>> Is it correct to pass the column name to regexp_split_to_array()? >>>>>> Yeah -- you are just passing a column's data into a function as an >>>>>> argument -- standard practice. This will work -- your bytea is really >>>>>> a text column, so it's just a matter of breaking up the string. >>>>>> regexp_* functions are great for that. >>>>> Thank you very much for your reply. >>>>> >>>>> I will launch the conversion right now. >>>>> >>>> The main reason to do this was to have smaller backups. >>>> The size of a compressed backup was around 1GB with bytea. >>>> I have converted the columns (on a copy of the database) but the expected >>>> gain are not here! >>>> With double precision[] it is still around 1GB (a little bit smaller but >>>> just a few MB). >>>> >>>> Also the size on the disk is not smaller. >>>> I have listed the content of /var/lib/postgres/8.4/main/base with du and >>>> the >>>> 2 versions have the same size (3.1GB). >>>> >>>> Does it make sense? >>>> My hypothesis is that the compression algorithm is able to find >>>> regularities >>>> the data so it finds the same regularity in bytea and in double >>>> precision[]. >>>> >>>> Is there any advantage to use double precision[] over bytea in my case? >>> probably not -- arrays can be significantly smaller than a set of >>> individual tuples each holding one value because of the tuple >>> overhead, but you still have to pay for the array header and a 4 byte >>> length/null indicator per element. >>> >>> A packed string is often the smallest way to store data, although not >>> necessarily the best. A double precision[] comes with a lot of syntax >>> advantages. >> Thanks for your advice! >> >> I find the result surprising because the floats are encoded with a lot of >> characters (something like 20) while a double is 8 bytes. >> >> I have tried to run VACCUM but it changed nothing... >> >> All of my code is based on strings so I won't take time to modify it if >> there is no gain. > right. remember that with each array element there is a 4 byte > indicator (which is the length and the null flag) for each element. > So array element size is 12 bytes, not 8, and you have the overhead of > the array itself which contains dimension information and type oid. > So, it's about a wash vs 15 digits precision floating point and one > byte delimiter. Are you sure that the length of each element is stored? That seems like a waste of space (if you now that you are in a double precision[] you know that each element is 8 bytes - or maybe array can contain different types). Hum, I forgot to say that I have several hundred or even thousands elements so saving even few bytes/element might ends up in giga-bytes. > If all the elements are known not null and the internally stored type > is fixed length, it's hypothetically possible to not have to store the > length/null indicator, but postgres does not do that and it might not > be worth it even if you wanted to. I initially wanted to store my vector directly as it is in memory. It's a bad practice since floats can be different from one machine to the other even if they all use IEEE 754 for instance (byte order, 32 or 64bits machines). I remembered having some headaches trying to send floats over the network between old G4 Macs and PC (G4 were using high-endian while Intel machines are little-endian). Moreover I use different programming languages (ranging from Ada to Python) so text was the more common format. I know that Ada has some ways to specify the memory layout of your objects but I don't have a lot of experience with that (and it's probably not the place to ask). C often use the closest type implemented in hardware so you in a mixed environment you have to deal with that Maybe a simple trick is simply to put less data (removing trailing zeros) and/or fixed relative precision. > merlin Thank you again for your time. The postgres list is always helpful :) Mathieu
On Tue, Jul 26, 2011 at 2:02 PM, Mathieu Dubois <mathieu.dubois@limsi.fr> wrote: > Are you sure that the length of each element is stored? no, I was wrong (I just checked the code) -- see CopyArrayEls. Null indicator is stored in a bitmap. hm. merlin
Hello, This morning I conducted a simple test: I generated 10 000 samples of 100-dimensional data (uniformly distributed in [-10 000; 10 000]) and inserted them into 2 newly created bases with only 1 table each. In one case the table has a bytea column (table OID 2102835) on the other a double precision[] (table OID 2102836). I generated scripts to insert the data in the 2 formats. After insertion, 'du -h' gives: 22M ./2102835 15M ./2102836 So the double precision[] is really smaller. I guess that in my case (altering column) there may be some kind of cache or maybe the toast table... It's really puzzling. Mathieu On 07/26/2011 09:19 PM, Merlin Moncure wrote: > On Tue, Jul 26, 2011 at 2:02 PM, Mathieu Dubois<mathieu.dubois@limsi.fr> wrote: >> Are you sure that the length of each element is stored? > no, I was wrong (I just checked the code) -- see CopyArrayEls. Null > indicator is stored in a bitmap. hm. > > merlin >