Re: count(*) and bad design was: Experiences with extensibility - Mailing list pgsql-general
From | Zoltan Boszormenyi |
---|---|
Subject | Re: count(*) and bad design was: Experiences with extensibility |
Date | |
Msg-id | 478520A3.8010407@cybertec.at Whole thread Raw |
In response to | Re: count(*) and bad design was: Experiences with extensibility (Ivan Sergio Borgonovo <mail@webthatworks.it>) |
Responses |
Re: count(*) and bad design was: Experiences with
extensibility
|
List | pgsql-general |
Ivan Sergio Borgonovo írta: > On Wed, 9 Jan 2008 13:04:39 +0100 > "Harald Armin Massa" <haraldarminmassa@gmail.com> wrote: > > >> Ivan, >> > > >>> Please forgive my naiveness in this field but what does it mean an >>> "exact count" and what other DB means with "an exact count" and >>> how other DB deal with it? >>> > > >> PostgreSQL will give you an exact count of the contents of the >> database as it is in the moment you begin your count. (i.e. the >> transaction starts) >> > > >> BUT as the table is not locked, in parallel somebody can bulkload >> MANY items into the database, so at the moment (start of your >> transaction) >> + 1msec your count may be invalid allready. >> > > That's reasonable. What other DB do and what is the SQL standard > requirement for count(*)? > > >>> I'd expect it perform as good or better than other DB since now >>> the bottleneck should be how efficiently it can filter records... >>> but still a count(*) with a where clause will incur in the same >>> problem of what "exact" means. >>> >> I know of 3 concepts to answer count() faster then PostreSQL: >> >> 1) just lie, present an estimate >> 2) do not have MVCC >> 3) store "record deleted info" in index, so you can answer count() >> with only scanning the index >> > > Sorry if I insist but I think this is a beaten selling point against > postgresql. > One of the most recurring argument about why someone else db is > better than pg is count is slow. > > Who lies? > If it is possible to do otherwise to have count run faster what are > the trade off that make it unreasonable to implement it in pg? > The decision to use MVCC in PostgreSQL makes the point moot. What your transaction cannot see (e.g. another transaction inserted a new record but didn't COMMITted yet) cannot be counted as visible. It's not only a theoretical but also practical. Compare that to the MaxDB way. At the very lowest level MaxDB keeps the records in a balanced b-tree based on its primary key. Look at this URL: http://blog.ulf-wendel.de/?p=76 MaxDB keeps how many records are in the tree at all times which is handy if your database is close to idle. Consequences: - very fast COUNT(*) when idle - records unCOMMITted by other transactions can be seen by COUNT(*) if your transaction is in "read committed" isolation level but not the actual records - COUNT(*) waits for all other transactions that modified the table in question to finish if your transaction is in "repeatable read" or "serializable" isolation levels. Consider that transactions can take a loooong time to finish if they process many things. This way your SELECT COUNT(*) doesn't respond instantly but doesn't slow down your server either. But the end user perception is the same: COUNT(*) is slow! In PostgreSQL, COUNT(*) responds closely at the same speed regardless of other transactions. Which way do you prefer? > This is not very useful question but 3) imply that select scan the > index return the rows and just later check if they are still there. > Is it? > And since indexes aren't updated "on the fly" you may get back a > larger number than what is actually the real value. > > Let me consider an everyday use where count() looks as the most > obvious solution: paging. > > I search trough a table and I need to know which is the last page. > Concurrently someone is deleting a ton of records. > No matter if count() is fast or not when I output the pager it will > be "wrong". But still I'll need an estimate of the number of pages, > it is not fun if that estimate is wrong *and* slow. > And once you add the where clauses there is no way to cache the count. > Is there a way to count based on indexes without taking into account > deleted rows so to "count" faster? > > I can make the search faster using indexes as Dann Corbit suggested, > but as you imply that won't make count as fast as the > "concurrence"[1] that lie or don't use MVCC or store deleted info in > indexes. > > SELECT reltuples FROM pg_class WHERE relname = <table_name>; > > doesn't apply since you can't add "conditions". > > Please be patient. I hear this over and over and over. Postgresql is > faulty because it can't count fast. > And the only reply I've seen are: it is bad design to use count, > you're a dumb ass. I admit I may be a dumb ass, but it is hard to > sell Postgres if I continue to be a dumb ass ;) > > - What count(*) really does? > Now I understood that count(*) return the # of rows as it sees them > at the moment it was invoked. That should be true for other DB as > well. That means that unless other DB lock the table they can't take > into account records that are going to be deleted once the count has > been issued. > > - When count can be avoided? > Well since even other DB may report the "wrong" number, this makes > count() look less a Sacre Graal. But still if you need an estimate, > wouldn't it be better to have it quickly? > How does postgresql compare to other DB when you run: > select count(*) from table where conditions > once you use indexes? > If such kind of query will have anyway to scan the results to see if > they are still there since info about deleted records aren't stored > in indexes, is there a way to ignore this and just have a faster > estimate? > I still can't see why it is bad design to use count(). > > - When count() can't be avoided? > All the situation where you may really need count() I think you also > need to lock the table but well I'd be curious to see an example > where you need count() > > Still can somebody make an example of bad design and one where > count() couldn't be avoided if any? > > Consider that while it makes few sense to rely on "wrong" numbers in > a "business" environment where data integrity/coherence makes *a lot* > of sense it is not so clear in a CMS world where most of those > critics come from. > > I know that the arguments to promote postgres in the "business" world > where DB2, Oracle and MS SQL play (?) may be different and count() > may lose its importance in that context and you could say that other > advantages plenty pay off the "slowness" of an operation that in such > a context is rare. > > > thanks > > > [1] or does postgres perform as the concurrence once you add where > clauses? > > > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
pgsql-general by date: