Thread: On using doubles as primary keys

On using doubles as primary keys

From

Kynn Jones

Date:

17 April 2015, 15:34:50

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text).  This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles.  In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing.  In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

Thanks in advance!

kj

Re: On using doubles as primary keys

From

Melvin Davidson

Date:

17 April 2015, 15:45:21

First, please ALWAYS include the version and O/S, even with basic questions.

I'm not sure what you mean by doubles. Do you mean bigint data type, or do you mean use two columns for a primary key? Either way it's pretty simple.

If you mean a bigint, then probably best to use serial data type, which will default to the next value.

If you mean use two columns for a Primary Key, the you just specify the columns.

EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

On Fri, Apr 17, 2015 at 11:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

Thanks in advance!

kj

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Re: On using doubles as primary keys

From

"David G. Johnston"

Date:

17 April 2015, 15:56:42

On Fri, Apr 17, 2015 at 8:45 AM, Melvin Davidson <melvin6925@gmail.com> wrote:

On Fri, Apr 17, 2015 at 11:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

One consideration that is complication the choice of primary key
is wanting to have the ability to store chunks of the data
table (not the metadata table), including the PK column, as
matrices of doubles. In its simplest form, this would mean using
doubles as primary keys, but this seems to me a bit weird.

I'm willing to go ahead with this, but I wanted to ask for your
feedback on the whole thing. In particular I'd like to know if
there are there standard ways for using doubles as primary keys,
and, conversely, if there are known pitfalls I should be looking
out for, but I welcome any other words of wisdom you may have on
this topic.

MD>>

First, please ALWAYS include the version and O/S, even with basic questions.

MD>>

I'm not sure what you mean by doubles. Do you mean bigint data type, or do you mean use two columns for a primary key? Either way it's pretty simple.

MD>>

If you mean a bigint, then probably best to use serial data type, which will default to the next value.

MD>>

If you mean use two columns for a Primary Key, the you just specify the columns.

MD>>

EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

I take it the OP means "double precision"

http://www.postgresql.org/docs/9.0/static/datatype-numeric.html

I'd be suspect of choosing a PK whose type definition includes the word "inexact".

You also say you want to store the data as a double precision but you never actually explain what the natural key of the data is.

If you are strictly storing serial (big) integers but doing so within a double precision typed column you may be OK - but I would advise taking input from someone with more expertise on the properties of IEEE floating point numbers.

David J.

Re: On using doubles as primary keys

From

Ray Cote

Date:

17 April 2015, 18:11:09

On Fri, Apr 17, 2015 at 11:56 AM, David G. Johnston <david.g.johnston@gmail.com> wrote:

MD>>

I'm not sure what you mean by doubles. Do you mean bigint data type, or do you mean use two columns for a primary key? Either way it's pretty simple.

MD>>

If you mean a bigint, then probably best to use serial data type, which will default to the next value.

MD>>

If you mean use two columns for a Primary Key, the you just specify the columns.

MD>>

EG: CONSTRAINT PRIMARY KEY table_name_pk PRIMARY KEY (col1, col2)

I take it the OP means "double precision"

http://www.postgresql.org/docs/9.0/static/datatype-numeric.html

I'd be suspect of choosing a PK whose type definition includes the word "inexact".

You also say you want to store the data as a double precision but you never actually explain what the natural key of the data is.

If you are strictly storing serial (big) integers but doing so within a double precision typed column you may be OK - but I would advise taking input from someone with more expertise on the properties of IEEE floating point numbers.

David J.

(Not an IEEE floating point expert, but...) I've learned the hard way to never rely on comparing two floating point numbers for equality -- and that's what you are doing if you join on them as primary keys. If you must use the underlying numeric data for joining, I'd recommend you do something like:

* output the value to a string (rounded to a specific number of digits)

* use the textual representation of the number as your primary key.

Alternatively, if the numbers are very large or very small, you may want to try converting the binary data into a BINHEX value and use that textual representation as the primary key.

Either way, I think you'd be saving yourself a lot of headaches using a representation that is not 'inexact.'

--Ray

Raymond Cote, President

voice: +1.603.924.6079 email: rgacote@AppropriateSolutions.com skype: ray.cote

Re: On using doubles as primary keys

From

Jim Nasby

Date:

17 April 2015, 18:38:49

On 4/17/15 1:10 PM, Ray Cote wrote:
>
>
> (Not an IEEE floating point expert, but...) I've learned the hard way to
> never rely on comparing two floating point numbers for equality -- and
> that's what you are doing if you join on them as primary keys. If you
> must use the underlying numeric data for joining, I'd recommend you do
> something like:
>    * output the value to a string (rounded to a specific number of digits)
>    * use the textual representation of the number as your primary key.

numeric would probably be more efficient, if you can use that instead.
It does suffer from some of the same issues as floating point (there's
been recent discussion of that on -hackers), but I believe it'd be
better than double.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: On using doubles as primary keys

From

John McKown

Date:

17 April 2015, 19:57:45

On Fri, Apr 17, 2015 at 10:34 AM, Kynn Jones <kynnjo@gmail.com> wrote:

I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.

(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)

My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.

I could be wrong, but it seems to me that the records in the two files have a one to one correspondence with each other. That is, record #1 in file #1 has data related to record #1 in file #2. And that there should therefore always be the same number of records in both files. So my first thought is why not just have one table in PostgreSQL which contains the data merged from the corresponding records in the two files? IMO, this is a cleaner data design than trying to ensure that two tables are kept "in sync" as records are added and deleted. Of course, this could possible hurt performance, due to reading a lot of data, if you consistently only need data from one file and only occasionally read the other.

If you insist on two for some reason, then I would have the load program read a record from each file, incrementing a counter, and INSERT the data from the files into the tables using the counter as the primary key. To ease things a bit, assuming the 1:1, I would have the PK on the more active file be a foreign key on the other file, with an ON CASCADE DELETE so that when you DELETE records from the more active, the corresponding record in the secondary table will automatically be deleted. You might even want some sort of CONTRAINT trigger to do a DELETE on the primary table if a record is DELETEd from the secondary.

Also, if you really want to separate the data into more than one table, then why only two? In this case, which I admit that I personally _dislike_, you might want to do an analysis of what columns you access together frequently and put those columns into separate tables. Again, the main plus of this would be memory usage and I/O on the data base server side. May Joe Celko forgive me for introducing physical DB configuration into a data structure discussion.

<snip>

Thanks in advance!

kj

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

If you sent twitter messages while exploring, are you on a textpedition?

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

Re: On using doubles as primary keys

From

Paul A Jungwirth

Date:

17 April 2015, 20:00:19

On Apr 17, 2015 8:35 AM, "Kynn Jones" <kynnjo@gmail.com> wrote:
> (The only reason for wanting to transfer this data to a Pg table
> is the hope that it will be easier to work with it by using SQL

800 million 8-byte numbers doesn't seem totally unreasonable for python/R/Matlab, if you have a lot of memory. Are you sure you want it in Postgres? Load the file once then filter it as you like. If you don't have the memory I can see how using Postgres to get fewer rows at a time might help. Fewer columns at a time would help even more if that's possible.

> In its simplest form, this would mean using
> doubles as primary keys, but this seems to me a bit weird.

I'd avoid that and just include an integer PK with your data. Datagrams in the languages above support that, or just slice off the PK column before doing your matrix math.

Also instead of 401 columns per row maybe store all 400 doubles in an array column? Not sure if that's useful for you but maybe it's worth considering.

Also if you put the metadata in the same table as the doubles, can you leave off the PKs altogether? Why join if you don't have to? It sounds like the tables are 1-to-1? Even if some metadata is not, maybe you can finesse it with hstore/arrays.

Good luck!

Paul