Re: performance for high-volume log insertion - Mailing list pgsql-performance
From | Richard Huxton |
---|---|
Subject | Re: performance for high-volume log insertion |
Date | |
Msg-id | 49ED8A37.4030509@archonet.com Whole thread Raw |
In response to | Re: performance for high-volume log insertion (david@lang.hm) |
Responses |
Re: performance for high-volume log insertion
|
List | pgsql-performance |
david@lang.hm wrote: > On Tue, 21 Apr 2009, Stephen Frost wrote: > >> * david@lang.hm (david@lang.hm) wrote: >>> while I fully understand the 'benchmark your situation' need, this isn't >>> that simple. >> >> It really is. You know your application, you know it's primary use >> cases, and probably have some data to play with. You're certainly in a >> much better situation to at least *try* and benchmark it than we are. > > rsyslog is a syslog server. it replaces (or for debian and fedora, has > replaced) your standard syslog daemon. it recieves log messages from > every app on your system (and possibly others), filters, maniulates > them, and then stores them somewhere. among the places that it can store > the logs are database servers (native support for MySQL, PostgreSQL, and > Oracle. plus libdbi for others) Well, from a performance standpoint the obvious things to do are: 1. Keep a connection open, do NOT reconnect for each log-statement 2. Batch log statements together where possible 3. Use prepared statements 4. Partition the tables by day/week/month/year (configurable I suppose) The first two are vital, the third takes you a step further. The fourth is a long-term admin thing. And possibly 5. Have two connections, one for fatal/error etc and one for info/debug level log statements (configurable split?). Then you can use the synchronous_commit setting on the less important ones. Might buy you some performance on a busy system. http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS > other apps then search and report on the data after it is stored. what > apps?, I don't know either. pick your favorite reporting tool and you'll > be a step ahead of me (I don't know a really good reporting tool) > > as for sample data, you have syslog messages, just like I do. so you > have the same access to data that I have. > > how would you want to query them? how would people far less experianced > that you want to query them? > > I can speculate that some people would do two columns (time, everything > else), others will do three (time, server, everything else), and others > will go further (I know some who would like to extract IP addresses > embedded in a message into their own column). some people will index on > the time and host, others will want to do full-text searches of everything. Well, assuming it looks much like traditional syslog, I would do something like: (timestamp, host, facility, priority, message). It's easy enough to stitch back together if people want that. PostgreSQL's full-text indexing is quite well suited to logfiles I'd have thought, since it knows about filenames, urls etc already. If you want to get fancy, add a msg_type column and one subsidiary table for each msg_type. So - you might have smtp_connect_from (hostname, ip_addr). A set of perl regexps can match and extract the fields for these extra tables, or you could do it with triggers inside the database. I think it makes sense to do it in the application. Easier for users to contribute new patterns/extractions. Meanwhile, the core table is untouched so you don't *need* to know about these extra tables. If you have subsidiary tables, you'll want to partition those too and perhaps stick them in their own schema (logs200901, logs200902 etc). -- Richard Huxton Archonet Ltd
pgsql-performance by date: