Re: Performance while loading data and indexing - Mailing list pgsql-general
From | Martijn van Oosterhout |
---|---|
Subject | Re: Performance while loading data and indexing |
Date | |
Msg-id | 20020926090519.GB10471@svana.org Whole thread Raw |
In response to | Performance while loading data and indexing ("Shridhar Daithankar" <shridhar_daithankar@persistent.co.in>) |
Responses |
Re: Performance while loading data and indexing
|
List | pgsql-general |
I'll preface this by saying that while I have a large database, it doesn't require quite the performace you're talking about here. On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote: > 1) Database load time from flat file using copy is very high > 2) Creating index takes huge amount of time. > 3) Any suggsestions for runtime as data load and query will be going in > parallel. You're loading all the data in one copy. I find that INSERTs are mostly limited by indexes. While index lookups are cheap, they are not free and each index needs to be updated for each row. I fond using partial indexes to only index the rows you actually use can help with the loading. It's a bit obscure though. As for parallel loading, you'll be limited mostly by your I/O bandwidth. Have you measured it to take sure it's up to speed? > Now the details. Note that this is a test run only.. > > Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI > RedHat7.2/PostgreSQL7.1.3 > > Database in flat file: > 125,000,000 records of around 100 bytes each. > Flat file size 12GB > > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. > Create unique composite index on 2 char and a timestamp field: 25226 sec. > Database size on disk: 26GB > Select query: 1.5 sec. for approx. 150 rows. So you're loading at a rate of 860KB per sec. That's not too fast. How many indexes are active at that time? Triggers and foreign keys also take their toll. > Important postgresql.conf settings > > sort_mem = 12000 > shared_buffers = 24000 > fsync=true (Sad but true. Left untouched.. Will that make a difference on > SCSI?) > wal_buffers = 65536 > wal_files = 64 fsync IIRC only affects the WAL buffers now but it may be quite expensive, especially considering it's running on every transaction commit. Oh, your WAL files are on a seperate disk from the data? > Initial flat data load: 250GB of data. This has gone up since last query. It > was 150GB earlier.. > Ongoing inserts: 5000/sec. > Number of queries: 4800 queries/hour > Query response time: 10 sec. That looks quite acheivable. > 1) Instead of copying from a single 12GB data file, will a parallel copy from > say 5 files will speed up the things? Limited by I/O bandwidth. On linux vmstat can tell you how many blocks are being loaded and stored per second. Try it. As long as sync() doesn't get done too often, it should be help. > Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 > setup.. No, it's not. You should be able to do better. > 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further > addition to improve create index performance? Should be fine. Admittedly your indexes are taking rather long to build. > 3) 5K concurrent inserts with an index on, will this need a additional CPU > power? Like deploying it on dual RISC CPUs etc? It shouldn't. Do you have an idea of what your CPU usage is? ps aux should give you a decent idea. > 4) Query performance is not a problem. Though 4.8K queries per sec. expected > response time from each query is 10 sec. But my guess is some serius CPU power > will be chewed there too.. Should be fine. > 5)Will upgrading to 7.2.2/7.3 beta help? Possibly, though it may be wirth it just for the features/bugfixes. > All in all, in the test, we didn't see the performance where hardware is > saturated to it's limits. So effectively we are not able to get postgresql > making use of it. Just pushing WAL and shared buffers does not seem to be the > solution. > > If you guys have any suggestions. let me know. I need them all.. Find the bottleneck: CPU, I/O or memory? > Mysql is almost out because it's creating index for last 17 hours. I don't > think it will keep up with 5K inserts per sec. with index. SAP DB is under > evaluation too. But postgresql is most favourite as of now because it works. So > I need to come up with solutions to problems that will occur in near future.. > ;-) 17 hours! Ouch. Either way, you should be able to do much better. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
pgsql-general by date: