Home > mailing lists

Re: Berkeley DB... - Mailing list pgsql-hackers

From	Michael A. Olson
Subject	Re: Berkeley DB...
Date	May 22, 2000 00:12:28
Msg-id	200005220411.VAA84795@triplerock.olsons.net Whole thread Raw
In response to	Re: Berkeley DB... (Chris Bitmead <chrisb@nimrod.itg.telstra.com.au>)
List	pgsql-hackers

Tree view

I'm responding in a single message to several questions prompted by my
message of this morning.

Chris Bitmead asked:

> > A single record or a single key may be up to 4GB in size.
> 
> That's certainly nice. But if you don't access a BIG column, you have to
> retrieve the whole record?

You can do partial reads, but only at the expense of complicating the
code in the PostgreSQL server.  We provide some reasonable interfaces
for fetching only part of large records.  However, you need to know
whether you should call them or not.  As a result, you'd have to
record in the system catalog somewhere that a particular table contained
big tuples, and you'd have to write your fetch code to read only the
byte range you care about.

That would complicate the server, since you'd want to do simple
fetches for the simple case, too.

Vadim Mikheev made some good points on space reuse.  Unless a page is
empty space on the page for keys in the right range.  For append-only 
workloads (like increasing heap tids), that's not what you want.

Vadim then asked:

> You can't merge two 49% empty pages in one. So, how to reuse
> this 49%? How will we able to implement feature that good
> databases have: one can specify while table creation -
> "insert new tuples on pages which N% empty"?

We already recognize the special case of in-order insersions (as in
the case of increasing heap tids).  We split pages so that the right
child is nearly empty and left is nearly full.  That gives you close
to 100% space utilization at build time.  Adding a fill factor to
the initialization code would be very easy.

> And, while we are on heap subject - using index (RECNO) for heap
> means that all our secondary-index scans will performe TWO
> index scans - first, to find recno in secondary-index, and
> second, to find heap tuple using recno (now indices give us
> TID, which is physical address).

We're not going to resolve this question without building both
systems and measuring their performance.  The non-leaf levels of
btrees are pretty much always in the cache because they're hot.
Whether your fetch-a-tuple code path is shorter than my fetch-
a-tuple code path is undecided.

Frankly, based on my experience with Berkeley DB, I'd bet on mine.
I can do 2300 tuple fetches per CPU per second, with linear scale-
up to at least four processors (that's what we had on the box we
used).  That's 9200 fetches a second.  Performance isn't going
to be the deciding issue.

(The test system was a mid-range Solaris box -- reasonable, but not
extravagant, clock speed, memory, and disk.)

On testing failure at critical points in the code, Vadim wrote:

> Oh, testing of this case is very easy - I'll just stop backend
> using gdb in critical points and will turn power off -:))
> I've run 2-3 backends under gdb to catch some concurrency-related
> bug in buffer manager - this technique works very well -:)

For limited concurrency and fairly simple problems, that technique
works well.  You should plan to test PostgreSQL with hundreds of
concurrent backends with a complex workload for days in order to
convince people that the system works correctly.  This is what the
commercial vendors (including Sleepycat!) do.  Your testing
strategy should include randomly killing the system to demonstrate
that you recover correctly.

I'm only warning you to be careful and to take this seriously.  It's
very hard to do the kind of testing you should.  The recovery system
is generally the most poorly-exercised part of the system, but it's
the one piece that absolutely has to work flawlessly.  It only runs
after your system has crashed, and your customer is already angry.

Finally, Vadim makes the point that switching to Berkeley DB forces
you to stop working on code you understand, and to pick up a new
package altogether.  Worse, you'd need to do some pretty serious
engineering to get multi-version concurrency control into Berkeley
DB before you could use it.

This is a pretty compelling argument to me.  I've been busy
explaining how you *could* make the switch, but the real question
is whether you *should*.  I can answer all of Vadim's questions
reasonably.  Frankly, though, if I were in charge of the engineering
effort for PostgreSQL, I'd be disinclined to use Berkeley DB on the
strength of the interface changes it requires and the effort that
would be required to implement MVCC.

I say this in the spirit of complete disclosure -- we'd like to
see you use our software, but you need to make a business decision
here.  If you hadn't already done MVCC, I'd be arguing the other
side, but you have.

Regards,                mike

pgsql-hackers by date:

From: Tom Lane
Date: 22 May 2000, 00:09:25
Subject: Re: Last call for comments: fmgr rewrite [LONG]

From: Bruce Momjian
Date: 22 May 2000, 00:19:24
Subject: Re: Last call for comments: fmgr rewrite [LONG]

Re: Berkeley DB... - Mailing list pgsql-hackers

Previous

Next