Thread: On using doubles as primary keys
I have some data in the form of a matrix of doubles (~2 million rows, ~400 columns) that I would like to store in a Pg table, along with the associated table of metadata (same number of rows, ~30 columns, almost all text). This is large enough to make working with it from flat files unwieldy. (The only reason for wanting to transfer this data to a Pg table is the hope that it will be easier to work with it by using SQL queries to extract subsets of it, than by, e.g., writing programs that must scan the entire matrix every time they're run.) My plan is to add one extra column to the Pg tables to serve as the primary key joining the data and the metadata tables together. One consideration that is complication the choice of primary key is wanting to have the ability to store chunks of the data table (not the metadata table), including the PK column, as matrices of doubles. In its simplest form, this would mean using doubles as primary keys, but this seems to me a bit weird. I'm willing to go ahead with this, but I wanted to ask for your feedback on the whole thing. In particular I'd like to know if there are there standard ways for using doubles as primary keys, and, conversely, if there are known pitfalls I should be looking out for, but I welcome any other words of wisdom you may have on this topic. Thanks in advance! kj
I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.
(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)
My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.
One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.
I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.
Thanks in advance!
kj
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

On Fri, Apr 17, 2015 at 11:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:
One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.
I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.
MD>>I'm not sure what you mean by doubles. Do you mean bigint data type, or do you mean use two columns for a primary key? Either way it's pretty simple.MD>>If you mean a bigint, then probably best to use serial data type, which will default to the next value.MD>>If you mean use two columns for a Primary Key, the you just specify the columns.MD>>EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)I take it the OP means "double precision" I'd be suspect of choosing a PK whose type definition includes the word "inexact".You also say you want to store the data as a double precision but you never actually explain what the natural key of the data is.If you are strictly storing serial (big) integers but doing so within a double precision typed column you may be OK - but I would advise taking input from someone with more expertise on the properties of IEEE floating point numbers.David J.
(Not an IEEE floating point expert, but...) I've learned the hard way to never rely on comparing two floating point numbers for equality -- and that's what you are doing if you join on them as primary keys. If you must use the underlying numeric data for joining, I'd recommend you do something like:
On 4/17/15 1:10 PM, Ray Cote wrote: > > > (Not an IEEE floating point expert, but...) I've learned the hard way to > never rely on comparing two floating point numbers for equality -- and > that's what you are doing if you join on them as primary keys. If you > must use the underlying numeric data for joining, I'd recommend you do > something like: > * output the value to a string (rounded to a specific number of digits) > * use the textual representation of the number as your primary key. numeric would probably be more efficient, if you can use that instead. It does suffer from some of the same issues as floating point (there's been recent discussion of that on -hackers), but I believe it'd be better than double. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.
(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)
My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.
Thanks in advance!
kj
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
He's about as useful as a wax frying pan.
10 to the 12th power microphones = 1 Megaphone
Maranatha! <><
John McKown
On Apr 17, 2015 8:35 AM, "Kynn Jones" <kynnjo@gmail.com> wrote:
> (The only reason for wanting to transfer this data to a Pg table
> is the hope that it will be easier to work with it by using SQL
800 million 8-byte numbers doesn't seem totally unreasonable for python/R/Matlab, if you have a lot of memory. Are you sure you want it in Postgres? Load the file once then filter it as you like. If you don't have the memory I can see how using Postgres to get fewer rows at a time might help. Fewer columns at a time would help even more if that's possible.
> In its simplest form, this would mean using
> doubles as primary keys, but this seems to me a bit weird.
I'd avoid that and just include an integer PK with your data. Datagrams in the languages above support that, or just slice off the PK column before doing your matrix math.
Also instead of 401 columns per row maybe store all 400 doubles in an array column? Not sure if that's useful for you but maybe it's worth considering.
Also if you put the metadata in the same table as the doubles, can you leave off the PKs altogether? Why join if you don't have to? It sounds like the tables are 1-to-1? Even if some metadata is not, maybe you can finesse it with hstore/arrays.
Good luck!
Paul