Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications? - Mailing list pgsql-general
From | Mike Mascari |
---|---|
Subject | Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications? |
Date | |
Msg-id | 386A5BBD.71375361@mascari.com Whole thread Raw |
In response to | RE: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications? ("Barnes" <aardvark@ibm.net>) |
Responses |
Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications?
|
List | pgsql-general |
Barnes wrote: > > It would be helpful to me to hear about successful and stable > implementations as well. If some of you who are using PostgreSQL > successfully could comment on your experiences, I think it would shed some > worthwhile light on it's capabilities. I'm considering using it for a > mission critical project, and I would like to know what I am getting into. > Thank you. > > David Barnes > We've used it successfully in a production environment (24 x 7) for over a year now. Simply reading the mailing list will greatly improve your chances of success. The problems with PostgreSQL can be avoided if you know, in advance, what to avoid. But must people don't. Here's my list of things which can get you into trouble: 0. Running with fsync on - There is the probability that modified records written into kernel buffers but not written to disk could exist at the moment of an operating system crash. Therefore, PostgreSQL's default mode is to run with fsync() on. This slows down the database by (quite literally) several orders of magnitude. We've run with fsync off (-o -F) without a problem. Dump/Reload of large databases with fsync() on really tests one's pain threshold. If you trust your OS, run with it off. 1. Making use of oids - Problems with dumping/restoring oids make this development path dangerous. Most people use the SERIAL data type to generate primary keys. However, SERIAL itself has some peculiarities, since it just auto-creates a sequence. Dropping the associated table doesn't drop the sequence (IIRC), so scripted or automated schema creation may not be obvious. I prefer to manually use a sequence and an int4 type for primary keys. In fact, you could use the same sequence for all of your primary keys, if you aren't exposing the value to the user in any meaningful way, and don't plan to hit the 4.2 billion limit of int4 soon, and don't care about gaps...(although a purist would argue, and I agree, that IF you are going to use generated keys THEN the key should have no more meaning then that it refers to a record). 2. Using views created with large queries - Views use the rewrite system and rules to rewrite a query against it to properly fetch data from the underlying tables. Because there is currently a limit on the size of a single database record (8192 bytes), the queries associated with views can only be so big. In addition, you can get into trouble if views are built on top of user-defined functions, which is a common thing to do. If you drop/recreate the underlying function, then the view needs to be dropped and recreated as well. In addition, I've had trouble with dump/reload of views in the past, and have always kept my schema in separate views.sql script, just in case... 3. Using non-standard types - Because of problems with data comparisons, type coercion and spaces, we avoided types such as bpchar, char, and even text. We avoided text since ODBC clients could not determine maximum field width. We also avoided all of the non-4 byte integer types, such as int2. This is because the default type coercion (sp?) code in the past has had trouble being smart enough to use indexes when given a SELECT such as: CREATE TABLE y (x text, z int2); SELECT x FROM y WHERE z = 3; because the 3 would be coerced to an int4 and, if z was an int2, would result in a sequential scan, whereas: SELECT x FROM y WHERE z = '3'; would use the index since it is initially parsed as a string and coerced properly at a later point. I think much of this has been fixed, but nevertheless... In addition, our varchar() types are pretty much under the 255 limit since some ODBC clients have problems with varchar() types greater than 255. We only use: int4, varchar, datetime, and float8. On rare occasion, we'll use text for free-form information, but we NEVER index it. Although its VERY tempting, (and PostgreSQL itself uses them), we avoid arrays. 4. Be careful about user-defined functions/triggers - PostgreSQL keeps track of everything by its oid, not by name (which would obviously be too slow). But, unfortunately, it does not yet support the modification of functions, allowing the function to retain its original oid (or perform a cascading update - it will be nice when RI is integrated into the system catalogue!). As a result, odd things can happen if you drop and recreate a function. For example, you could have a trigger call a procedural language which, in turn, could select from a view, from which one of the attributes is the result of a function. If you dropped/recreated that function, things go a bit weird and usually result in an error such as "function not in cache". 5. Using DDL statements in transactions - PostgreSQL has trouble rolling back transactions which have aborted which contain DDL statements. As a result, you might find yourself having to delete a filesystem file, because, even though a TABLE create might have been rolled back as far as the system catalogue is concerned, the underlying file might still manage to exist. Or worse, rollback of index DROP/CREATE in a transaction yields erroneous results. 6. Using indexes on large fields - Apparently the code requires 3 tuples per page (or something like that) for the index to function properly. This can include plpgsql source, so be careful. We never index on anything larger than 255, but I believe around 2.8K is the limit before tickling bugs... 7. Using INSERTS instead of COPY - Even when you have fsync() off and are running INSERT statements in a transaction, the processing of individual INSERT statements by the thousands is also several orders of magnitude slower than COPY. We have large mainframe datasets which we import nightly - we first covert them to data appropriate for COPY and then COPY them in, instead INSERT's record by record. The problem with COPY is it runs as user postgres, so you need to have the data files readable by user postgres. 8. Not running VACUUM - PostgreSQL won't use indexes, or won't optimize correctly unless the record count and dispersion estimates are up-to-date. People have reported problems with running vacuum while under heavy load. We haven't seen it, but we run vacuum each night at 4:05 a.m. However, if you perform a LARGE number of INSERTS/UPDATES, it is better for you to do the following: DROP INDEX index_on_heavilty_used_table; VACUUM ANALYZE; CREATE INDEX index_on_heavily_used_table; Because VACUUM will sit there, and, row by row, essentially "defragment" your indexes, which can take damn near forever for any number of updates or deletes greater than, say, 30,000 rows. 9. ALTER TABLE ADD COLUMN - Its better to rebuild the table by hand then to use this DDL statement. First off, any column constraints (such as NOT NULL), will silently ignored, and secondly, inherited relations have problems with dump/restore. 10. IN, INTERSECT, EXCEPT - When writing your application, these SQL functions seem nice, particularly since the data in your design database may be small, initially. But all three of these SQL expressions (whatever) force a nested sequential scan on the relation. For example: emptoris=> explain SELECT employee FROM employees WHERE employee NOT IN (SELECT webuser FROM webusers); NOTICE: QUERY PLAN: Seq Scan on employees (cost=3.95 rows=59 width=12) SubPlan -> Seq Scan on webusers (cost=7.78 rows=145 width=12) EXPLAIN Since INTERSECT/EXCEPT rewrite the query to use IN, the problem exists with them as well. And since PostgreSQL does not yet have outer joins, you should instead write the query using a correlated sub query (EXISTS): emptoris=> explain SELECT employee FROM employees WHERE NOT EXISTS (SELECT webuser FROM webusers WHERE webusers.webuser = employees.employee); NOTICE: QUERY PLAN: Seq Scan on employees (cost=3.95 rows=59 width=12) SubPlan -> Index Scan using k_webusers1 on webusers (cost=2.05 rows=1 width=12) EXPLAIN There are many more such things which, if avoided, allow PostgreSQL to work great. But with each release, a lot of these things become obsolete. Mike Mascari
pgsql-general by date: