Thread: [GENERAL] "Shared strings"-style table
hey, In the spreadsheet world, there is this concept of "shared strings," a simple way of compressing spreadsheets when the data is duplicated in many cells. In my database, I have a table with >200 million rows and >300 columns (all the households in the United States). For clarity of development and debugging, I have not made any effort to normalize its contents, so millions of rows have, for example, "SINGLE FAMILY RESIDENCE / TOWNHOUSE" (yes, that whole string!) instead of some code representing it. Theoretically / blue sky, could there be a table or column type that transparently handles "shared strings" like this, reducing size on disk at the cost of lookup overhead for all queries? (I guess maybe it's like TOAST, but content-hashed and de-duped and not only for large objects?) Thanks, Seamus -- Seamus Abshere, SCEA https://www.faraday.io https://github.com/seamusabshere https://linkedin.com/in/seamusabshere -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 10/13/2017 09:49 AM, Seamus Abshere wrote: > hey, > > In the spreadsheet world, there is this concept of "shared strings," a > simple way of compressing spreadsheets when the data is duplicated in > many cells. > > In my database, I have a table with >200 million rows and >300 columns > (all the households in the United States). For clarity of development > and debugging, I have not made any effort to normalize its contents, so > millions of rows have, for example, "SINGLE FAMILY RESIDENCE / > TOWNHOUSE" (yes, that whole string!) instead of some code representing > it. > > Theoretically / blue sky, could there be a table or column type that > transparently handles "shared strings" like this, reducing size on disk > at the cost of lookup overhead for all queries? > > (I guess maybe it's like TOAST, but content-hashed and de-duped and not > only for large objects?) > > Thanks, > Seamus > > -- > Seamus Abshere, SCEA > https://www.faraday.io > https://github.com/seamusabshere > https://linkedin.com/in/seamusabshere > > What data type are these columns now? I would be tempted to map the full strings to an abbreviation just so I didn't have to alter all the columns to an "id"; Optional to place any RI on the columns to the abbreviation dictionary table. Just use the translation as a last step in user facing reports. If you can map/abbreviate to 4 characters, you've approximated the disk size of an integer. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Theoretically / blue sky, could there be a table or column type that
transparently handles "shared strings" like this, reducing size on disk
at the cost of lookup overhead for all queries?
(I guess maybe it's like TOAST, but content-hashed and de-duped and not
only for large objects?)
Row-independence is baked into PostgreSQL pretty deeply...
I think an enum type is about as close are you are likely to get if you don't wish to setup your own foreign-key relationships with surrogate keys.
David J.
> On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote > > Theoretically / blue sky, could there be a table or column type that > > transparently handles "shared strings" like this, reducing size on disk > > at the cost of lookup overhead for all queries? > > (I guess maybe it's like TOAST, but content-hashed and de-duped and not > > only for large objects?) On Fri, Oct 13, 2017, at 01:12 PM, David G. Johnston wrote: > Row-independence is baked into PostgreSQL pretty deeply... Could you say more about that? What about the comparison to TOAST, which stores values off-table? Thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 13, 2017 at 12:12 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
Theoretically / blue sky, could there be a table or column type that
transparently handles "shared strings" like this, reducing size on disk
at the cost of lookup overhead for all queries?
(I guess maybe it's like TOAST, but content-hashed and de-duped and not
only for large objects?)Row-independence is baked into PostgreSQL pretty deeply...I think an enum type is about as close are you are likely to get if you don't wish to setup your own foreign-key relationships with surrogate keys.David J.
I STRONGLY advise againt the use of ENUMS.
What was described is exactly what relations and Foreign Keys are for.
Example:
CREATE TABLE residence_type
(
residence_type_id INTEGER NOT NULL,
residence_type_desc TEXT NOT NULL,
CONSTRAINT residence_type_pk PRIMARY KEY (residence_type_id)
);
CREATE TABLE state
(
state_id CHAR(02) NOT NULL,
state_name TEXT NOT NULL,
CONSTRAINT state_pk PRIMARY KEY (state_id)
);
CREATE TABLE residence
(
residence_id BIGINT NOT NULL,
residence_type_id INTEGER,
street_num CHARACTER(10),
street_name CHARACTER(20),
city CHARACTER(40),
state_id CHAR(02),
CONSTRAINT residence_pk PRIMARY KEY (residence_id),
CONSTRAINT fk_residence_state FOREIGN KEY (state_id)
REFERENCES state (state_id),
CONSTRAINT fk_residence_type FOREIGN KEY (residence_type_id)
REFERENCES residence_type (residence_type_id)
);
SELECT t.residence_type_desc,
r.street_num,
r.street_name,
r.city,
s.state_name
FROM residence r
JOIN residence_type t ON t.residence_id = r.residence_id
JOIN state s ON s.state_id = r.state_id
WHERE residence_id = 12345;
Example:
CREATE TABLE residence_type
(
residence_type_id INTEGER NOT NULL,
residence_type_desc TEXT NOT NULL,
CONSTRAINT residence_type_pk PRIMARY KEY (residence_type_id)
);
CREATE TABLE state
(
state_id CHAR(02) NOT NULL,
state_name TEXT NOT NULL,
CONSTRAINT state_pk PRIMARY KEY (state_id)
);
CREATE TABLE residence
(
residence_id BIGINT NOT NULL,
residence_type_id INTEGER,
street_num CHARACTER(10),
street_name CHARACTER(20),
city CHARACTER(40),
state_id CHAR(02),
CONSTRAINT residence_pk PRIMARY KEY (residence_id),
CONSTRAINT fk_residence_state FOREIGN KEY (state_id)
REFERENCES state (state_id),
CONSTRAINT fk_residence_type FOREIGN KEY (residence_type_id)
REFERENCES residence_type (residence_type_id)
);
SELECT t.residence_type_desc,
r.street_num,
r.street_name,
r.city,
s.state_name
FROM residence r
JOIN residence_type t ON t.residence_id = r.residence_id
JOIN state s ON s.state_id = r.state_id
WHERE residence_id = 12345;
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

> > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote: > >> Theoretically / blue sky, could there be a table or column type that > >> transparently handles "shared strings" like this, reducing size on disk > >> at the cost of lookup overhead for all queries? > >> (I guess maybe it's like TOAST, but content-hashed and de-duped and not > >> only for large objects?) On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote: > What was described is exactly what relations and Foreign Keys are for. hi Melvin, appreciate the reminder. Our issue is that we have 300+ columns and frequently include them in the SELECT or WHERE clauses... so traditional normalization would involve hundreds of joins. That's why I ask about a new table or column type that handles basic translation and de-duping transparently, keeping the coded values in-table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 13, 2017 at 12:43 PM, Seamus Abshere <seamus@abshere.net> wrote:
> > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:
> >> Theoretically / blue sky, could there be a table or column type that
> >> transparently handles "shared strings" like this, reducing size on disk
> >> at the cost of lookup overhead for all queries?
> >> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> >> only for large objects?)
On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:
> What was described is exactly what relations and Foreign Keys are for.
hi Melvin, appreciate the reminder. Our issue is that we have 300+
columns and frequently include them in the SELECT or WHERE clauses... so
traditional normalization would involve hundreds of joins.
That's why I ask about a new table or column type that handles basic
translation and de-duping transparently, keeping the coded values
in-table.
>I ask about a new table or column type that handles basic translation
AFAIK, there is no such thing currently available.Your initial post indicated you were working with spreadsheets and were
looking to translate to PostgreSQL database. There is no short cut to normalizing, but the time you spend doing so in initial
design will well be worthwhile once it is implemented.
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

On Fri, Oct 13, 2017 at 12:52 PM, Melvin Davidson <melvin6925@gmail.com> wrote:
On Fri, Oct 13, 2017 at 12:43 PM, Seamus Abshere <seamus@abshere.net> wrote:> > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:
> >> Theoretically / blue sky, could there be a table or column type that
> >> transparently handles "shared strings" like this, reducing size on disk
> >> at the cost of lookup overhead for all queries?
> >> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> >> only for large objects?)
On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:
> What was described is exactly what relations and Foreign Keys are for.
hi Melvin, appreciate the reminder. Our issue is that we have 300+
columns and frequently include them in the SELECT or WHERE clauses... so
traditional normalization would involve hundreds of joins.
That's why I ask about a new table or column type that handles basic
translation and de-duping transparently, keeping the coded values
in-table.
>I ask about a new table or column type that handles basic translationAFAIK, there is no such thing currently available.Your initial post indicated you were working with spreadsheets and werelooking to translate to PostgreSQL database. There is no short cut to normalizing, but the time you spend doing so in initialdesign will well be worthwhile once it is implemented.
--Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
Seamus,
Just a thought. As I mentioned previously, there is no shortcut to optimizing your database. However, you can do it in increments.
First, create all your foreign key / repetative data tables.
First, create all your foreign key / repetative data tables.
Next, add additional FK columns to you current tables to reference the fk / repetative data tables.
Modify your application / queries to utilize the new columns.
Do extensive testing to make sure your modifications work properly.
VERY IMPORTANT: Before the next step, make a backup of the existing database and verify you have a good copy.
Finally, drop all the old repetative data columns.
--
VERY IMPORTANT: Before the next step, make a backup of the existing database and verify you have a good copy.
Finally, drop all the old repetative data columns.
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

On 2017-10-13 12:49:21 -0300, Seamus Abshere wrote: > In the spreadsheet world, there is this concept of "shared strings," a > simple way of compressing spreadsheets when the data is duplicated in > many cells. > > In my database, I have a table with >200 million rows and >300 columns > (all the households in the United States). For clarity of development > and debugging, I have not made any effort to normalize its contents, so > millions of rows have, for example, "SINGLE FAMILY RESIDENCE / > TOWNHOUSE" (yes, that whole string!) instead of some code representing > it. > > Theoretically / blue sky, could there be a table or column type that > transparently handles "shared strings" like this, reducing size on disk > at the cost of lookup overhead for all queries? Theoretically it's certainly possible and I think some column-oriented databases store data that way. > (I guess maybe it's like TOAST, but content-hashed and de-duped and not > only for large objects?) Yes, but if you want to autmatically delete entries which are no longer needed you need to keep track of that. So either a reference count or an index lookup on the parent table. This is starting to look a lot like a foreign key - just hidden from the user. Performance would probably be similar, too. We have done something similar (although for different reasons). We ended up doing the "join" in the application. For most purposes we don't need the descriptive strings and when we need them we can do a dictionary lookup just before sending them to the client (of course the dictionary has to be read from the database, too, but it doesn't change that often, so it can be cached). And from a software maintainability POV I think a dictionary lookup in Perl is a lot nicer than 50 joins (or 300 in your case). hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
> On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote
> > Theoretically / blue sky, could there be a table or column type that
> > transparently handles "shared strings" like this, reducing size on disk
> > at the cost of lookup overhead for all queries?
> > (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> > only for large objects?)
On Fri, Oct 13, 2017, at 01:12 PM, David G. Johnston wrote:
> Row-independence is baked into PostgreSQL pretty deeply...
Could you say more about that?
Not intelligibly...basically as far as PostgreSQL is concerned all the data to reconstruct a row from a given table is present in that table. From a practical perspective the "TOAST table" for a table IS part of the main table since it has no practical independent use.
As an aside I was thinking along the lines of an actual compression routine which is what a spreadsheet file is able to do since a spreadsheet contains the data from every row and column in a single file and is able to compress the entire file by finding commonalities across rows and columns. A database generally cannot do that.
As for "transparent lookup tables for text columns"...I suppose one could implement a "system-managed-enum" type with many of the same properties of an actual enum but avoiding many of its problems by not exposing the enum-ness to the user and instead just exposing the text labels...I suspect faced with prospect of doing something that complex most users would just setup a FK relationship.
What about the comparison to TOAST, which stores values off-table?
TOAST solves a technical problem related to the fact that records "on the table" have a very small size limitation (kb) while stored values can be at least as large as a GB. TOAST does involved compression but the input to the compression algorithm is a single cell (row and column) in a table. As noted above I consider the TOAST table and main table to be a single logical table.
Like I said the enum type has similar properties to what you want - but Melvin is right that using it requires careful consideration of how your data might change in the future.
David J.
Seamus Abshere schrieb am 13.10.2017 um 18:43: >>> On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote: >>>> Theoretically / blue sky, could there be a table or column type that >>>> transparently handles "shared strings" like this, reducing size on disk >>>> at the cost of lookup overhead for all queries? >>>> (I guess maybe it's like TOAST, but content-hashed and de-duped and not >>>> only for large objects?) > > On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote: >> What was described is exactly what relations and Foreign Keys are for. > > hi Melvin, appreciate the reminder. Our issue is that we have 300+ > columns and frequently include them in the SELECT or WHERE clauses... so > traditional normalization would involve hundreds of joins. > > That's why I ask about a new table or column type that handles basic > translation and de-duping transparently, keeping the coded values > in-table. As those are all strings: why not put them into a hstore (or jsonb) column? Then all of them could compressed (TOASTed) as a single value instead of 300 independent values. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 13, 2017, at 03:16 PM, David G. Johnston wrote: > implement a "system-managed-enum" type with many of the same properties [...] > TOAST does involved compression but the input to > the compression algorithm is a single cell (row and column) in a table. > As noted above I consider the TOAST table and main table to be a single > logical table. See this sounds like _exactly_ what I want. Except with a content hash instead of an id. Seems to me like all of the machinery that allows you to look things up by TOASTed columns and subsequently return TOASTed values as if they resided in the same physical table is what is needed. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general