Thread: Postgresql as a dictionary coder backend?

Postgresql as a dictionary coder backend?

From

Attila Nagy

Date:

23 January 2011, 10:33:25

  Hello,

I'm looking for a database backend for a dictionary coder project. It
would have three major tasks:
- take a text corpus, get their words and substitute each word by a 64
bit integer (the word:integer is always constant) and store the result
(encoding)
- take the previous result and substitute the integers with words (decoding)
- the words should be reference counted, so if a word can be no longer
found in any of the encoded messages, delete it (and optionally free
it's integer ID, but 64 bit is believed to be enough for a long time,
although having smaller IDs result smaller encoded files). This could be
achieved by informing the database of the words of a deleted message, so
it could decrement those refcounts and delete the records if needed.

I can easily do this with any RDBMS, with a table of three columns: auto
incremented ID, word and refcount, with a unique index on word.
The challenge could be:
- that it should scale to several TBs of size and several (hundred)
billion of records. One scenario would be to store about 40 TBs of words
and the average word length would be about 50-60 bytes (that's about
800*10^9 records). It should work well both for inserting and searching
(encoding and decoding) words.
- I need atomicity and durability, but having these on a word (record)
level takes too much IOPS and have no use, so it would be good to have
an interface for inserting about 1000-500000 words in one call, assign a
unique ID to each unique words and store them (if the word has had
already an ID, increment its refcount) and give back the IDs for each
words. This transaction could be committed as one, so the transactions
could be big, sparing IOPS.
- I need concurrency, so when the above happens from two sources at the
same time, the same word in the two transactions must get the same ID

Is postgresql a good choice for doing this and if yes, what would be the
optimal (for both time and space efficiency at encoding and decoding)
use case?

Thanks,

Re: Postgresql as a dictionary coder backend?

From

Ben Chobot

Date:

23 January 2011, 22:19:21

On Jan 23, 2011, at 3:29 AM, Attila Nagy wrote:

> Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit integer (the word:integer is always
constant)and store the result (encoding) 
> - take the previous result and substitute the integers with words (decoding)
> - the words should be reference counted, so if a word can be no longer found in any of the encoded messages, delete
it(and optionally free it's integer ID, but 64 bit is believed to be enough for a long time, although having smaller
IDsresult smaller encoded files). This could be achieved by informing the database of the words of a deleted message,
soit could decrement those refcounts and delete the records if needed. 

Just curious, is this your end goal, or would some existing text search engine (tsearch2, lucerne, etc) fit the bill?

> I can easily do this with any RDBMS, with a table of three columns: auto incremented ID, word and refcount, with a
uniqueindex on word. 
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion of records. One scenario would be to
storeabout 40 TBs of words and the average word length would be about 50-60 bytes (that's about 800*10^9 records). It
shouldwork well both for inserting and searching (encoding and decoding) words. 

Yes, Postgres can do this, but you're probably going to want to look into partitioning your tables. For instance,
insteadof a single word table, you can make word tables that store 100M words each - or if  100M ids per table doesn't
soundright, pick some other number. You can still access the table as if it's a single table, and under the covers PG
willuse constraint exclusion to know which table partition to use. 

> - I need atomicity and durability, but having these on a word (record) level takes too much IOPS and have no use, so
itwould be good to have an interface for inserting about 1000-500000 words in one call, assign a unique ID to each
uniquewords and store them (if the word has had already an ID, increment its refcount) and give back the IDs for each
words.This transaction could be committed as one, so the transactions could be big, sparing IOPS. 

I must be something something because it's unclear why standard transactions don't give you this? Except rolling back
IDsin the case of an abort, of course. Why do you need that? 

> - I need concurrency, so when the above happens from two sources at the same time, the same word in the two
transactionsmust get the same ID 

You can't have this and the previous constraint, at least not with PG. I don't know of any system that will give you
durablewords but still detect when the same word is being inserted in parallel transactions, AND save you the iops of
word-by-wordwrites. 

Unless, of course, you're ok with one transaction failing due to trying to add a word that wasn't there at the start?
Ifyou did something with a unique constraint on the word, you might be able to achieve this.

Re: Postgresql as a dictionary coder backend?

From

Fredric Fredricson

Date:

24 January 2011, 00:27:19

On 01/23/2011 12:29 PM, Attila Nagy wrote:
>  Hello,
>
> I'm looking for a database backend for a dictionary coder project. It
> would have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64
> bit integer (the word:integer is always constant) and store the result
> (encoding)
> - take the previous result and substitute the integers with words
> (decoding)
> - the words should be reference counted, so if a word can be no longer
> found in any of the encoded messages, delete it (and optionally free
> it's integer ID, but 64 bit is believed to be enough for a long time,
> although having smaller IDs result smaller encoded files). This could
> be achieved by informing the database of the words of a deleted
> message, so it could decrement those refcounts and delete the records
> if needed.
Why do you need 64 bits? An language contains somewhere around 1.000.000
words so 32 bits should be enough for more than 2000 languages. I read
somewhere that the Oxford Dictionary contains 150.000 words with a total
of 600.000 word forms.
Anyways, 64 bit seem to be a bit overkill, or am I missing something.
>
> I can easily do this with any RDBMS, with a table of three columns:
> auto incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred)
> billion of records. One scenario would be to store about 40 TBs of
> words and the average word length would be about 50-60 bytes (that's
> about 800*10^9 records). It should work well both for inserting and
> searching (encoding and decoding) words.
50-60 bytes! Oh, so we are not talking about natural language here.
Sorry, I just assumed that.
Still, I think performance will be a big issue here. I have never tried
postgresql on anything faster than a fast PC but in my humble experience
an insert will take at least one ms. With this speed 800*10^9 records
would take 25 years to insert.
I think you have to think bigger than a single server (ok, that was
stating the obvious).
> - I need atomicity and durability, but having these on a word (record)
> level takes too much IOPS and have no use, so it would be good to have
> an interface for inserting about 1000-500000 words in one call, assign
> a unique ID to each unique words and store them (if the word has had
> already an ID, increment its refcount) and give back the IDs for each
> words. This transaction could be committed as one, so the transactions
> could be big, sparing IOPS.
> - I need concurrency, so when the above happens from two sources at
> the same time, the same word in the two transactions must get the same ID
Unless the ID is some kind of hash you will have to serialize inserts of
new words.
>
> Is postgresql a good choice for doing this and if yes, what would be
> the optimal (for both time and space efficiency at encoding and
> decoding) use case?
I might be wrong but this kind of project should probably not rely on a
"standard" RDBMS. The data structure itself does not seem to complex and
the performance requirements are quite demanding.

But then, I'm no expert so don't take my word for it.
/Fredric
>
> Thanks,
>
>

Attachment

Fredric_Fredricson.vcf

Re: Postgresql as a dictionary coder backend?

From

Attila Nagy

Date:

24 January 2011, 16:26:43

On 01/24/2011 03:19 AM, Ben Chobot wrote:
> On Jan 23, 2011, at 3:29 AM, Attila Nagy wrote:
>
>> Hello,
>>
>> I'm looking for a database backend for a dictionary coder project. It would have three major tasks:
>> - take a text corpus, get their words and substitute each word by a 64 bit integer (the word:integer is always
constant)and store the result (encoding) 
>> - take the previous result and substitute the integers with words (decoding)
>> - the words should be reference counted, so if a word can be no longer found in any of the encoded messages, delete
it(and optionally free it's integer ID, but 64 bit is believed to be enough for a long time, although having smaller
IDsresult smaller encoded files). This could be achieved by informing the database of the words of a deleted message,
soit could decrement those refcounts and delete the records if needed. 
> Just curious, is this your end goal, or would some existing text search engine (tsearch2, lucerne, etc) fit the bill?
Yes, this isn't a text search project. Maybe data deduplication is
closer. Of course if a text search stuff fitst this, it's welcome. :)
>
>> I can easily do this with any RDBMS, with a table of three columns: auto incremented ID, word and refcount, with a
uniqueindex on word. 
>> The challenge could be:
>> - that it should scale to several TBs of size and several (hundred) billion of records. One scenario would be to
storeabout 40 TBs of words and the average word length would be about 50-60 bytes (that's about 800*10^9 records). It
shouldwork well both for inserting and searching (encoding and decoding) words. 
> Yes, Postgres can do this, but you're probably going to want to look into partitioning your tables. For instance,
insteadof a single word table, you can make word tables that store 100M words each - or if  100M ids per table doesn't
soundright, pick some other number. You can still access the table as if it's a single table, and under the covers PG
willuse constraint exclusion to know which table partition to use. 
The ID space should be continuous and unique across the tables, so a
traditional table partitioning across a key should work, but
partitioning as in sharding seems to be harder...
>> - I need atomicity and durability, but having these on a word (record) level takes too much IOPS and have no use, so
itwould be good to have an interface for inserting about 1000-500000 words in one call, assign a unique ID to each
uniquewords and store them (if the word has had already an ID, increment its refcount) and give back the IDs for each
words.This transaction could be committed as one, so the transactions could be big, sparing IOPS. 
> I must be something something because it's unclear why standard transactions don't give you this? Except rolling back
IDsin the case of an abort, of course. Why do you need that? 
Think of two documents, each of them arrive in the same time to the
encoder. If the two documents both have the word "boo", it must be given
the same ID. Giving unique constraint error for some rows of a batch
insert severely hits performance.

>> - I need concurrency, so when the above happens from two sources at the same time, the same word in the two
transactionsmust get the same ID 
> You can't have this and the previous constraint, at least not with PG. I don't know of any system that will give you
durablewords but still detect when the same word is being inserted in parallel transactions, AND save you the iops of
word-by-wordwrites. 
>
> Unless, of course, you're ok with one transaction failing due to trying to add a word that wasn't there at the start?
Ifyou did something with a unique constraint on the word, you might be able to achieve this. 
I agree, this seems to be hard in a concurrent manner in a database
(maybe with a clever caching stored procedure)..

Re: Postgresql as a dictionary coder backend?

From

Attila Nagy

Date:

24 January 2011, 16:34:39

  On 01/24/2011 05:27 AM, Fredric Fredricson wrote:
>>
>> I can easily do this with any RDBMS, with a table of three columns:
>> auto incremented ID, word and refcount, with a unique index on word.
>> The challenge could be:
>> - that it should scale to several TBs of size and several (hundred)
>> billion of records. One scenario would be to store about 40 TBs of
>> words and the average word length would be about 50-60 bytes (that's
>> about 800*10^9 records). It should work well both for inserting and
>> searching (encoding and decoding) words.
> 50-60 bytes! Oh, so we are not talking about natural language here.
> Sorry, I just assumed that.
Yes, sorry.
> Still, I think performance will be a big issue here. I have never
> tried postgresql on anything faster than a fast PC but in my humble
> experience an insert will take at least one ms. With this speed
> 800*10^9 records would take 25 years to insert.
> I think you have to think bigger than a single server (ok, that was
> stating the obvious).
Well, to be honest, that would be for a single server. And I would like
to have more than a single server. :)
One ms is 1000 TPS, if you can't utilize the DB more with multiple
threads. I haven't really got a fast PC yet, but I could easily achieve
12k TPS with some multicore stuff and BBWC RAID controllers.
It's only 771 years then. :)
>> - I need atomicity and durability, but having these on a word
>> (record) level takes too much IOPS and have no use, so it would be
>> good to have an interface for inserting about 1000-500000 words in
>> one call, assign a unique ID to each unique words and store them (if
>> the word has had already an ID, increment its refcount) and give back
>> the IDs for each words. This transaction could be committed as one,
>> so the transactions could be big, sparing IOPS.
>> - I need concurrency, so when the above happens from two sources at
>> the same time, the same word in the two transactions must get the
>> same ID
> Unless the ID is some kind of hash you will have to serialize inserts
> of new words.
I start to think this too, if the unique constraint is there...
>>
>> Is postgresql a good choice for doing this and if yes, what would be
>> the optimal (for both time and space efficiency at encoding and
>> decoding) use case?
> I might be wrong but this kind of project should probably not rely on
> a "standard" RDBMS. The data structure itself does not seem to complex
> and the performance requirements are quite demanding.
I've done some experiments with BerkeleyDB and Tokyo/Kyoto Cabinet, but
I was curious about the opinion of the RDBMS masters.

Thanks for sharing.

Re: Postgresql as a dictionary coder backend?

From

Cédric Villemain

Date:

25 January 2011, 18:14:09

2011/1/23 Attila Nagy <bra@fsn.hu>:
>  Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would
> have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit
> integer (the word:integer is always constant) and store the result
> (encoding)

ok. PostgreSQL allow to do that easily.

> - take the previous result and substitute the integers with words (decoding)

idem.

> - the words should be reference counted, so if a word can be no longer found
> in any of the encoded messages, delete it (and optionally free it's integer
> ID, but 64 bit is believed to be enough for a long time, although having
> smaller IDs result smaller encoded files). This could be achieved by
> informing the database of the words of a deleted message, so it could
> decrement those refcounts and delete the records if needed.

Yes, like what despez do :
http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/

>
> I can easily do this with any RDBMS, with a table of three columns: auto
> incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion
> of records. One scenario would be to store about 40 TBs of words and the
> average word length would be about 50-60 bytes (that's about 800*10^9
> records). It should work well both for inserting and searching (encoding and
> decoding) words.

I strongly suggest you to have a look at intarray contrib (it is
provided with PostgreSQL.

> - I need atomicity and durability, but having these on a word (record) level
> takes too much IOPS and have no use, so it would be good to have an
> interface for inserting about 1000-500000 words in one call, assign a unique
> ID to each unique words and store them (if the word has had already an ID,
> increment its refcount) and give back the IDs for each words. This
> transaction could be committed as one, so the transactions could be big,
> sparing IOPS.

Array allow a very good compression of the data per row. (still it is
not a RDBMS way to use array for that, but it is good for
performances)

> - I need concurrency, so when the above happens from two sources at the same
> time, the same word in the two transactions must get the same ID

one transaction will finish before the other to allow that. (but they
can start at the same time)

>
> Is postgresql a good choice for doing this and if yes, what would be the
> optimal (for both time and space efficiency at encoding and decoding) use
> case?

PostgreSQL should work for that, yes. You'll have to compensate the
size with good hardware and good SQL (and probably some optimization
like using arrays)

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support