Re: Using a COPY...FROM through JDBC? - Mailing list pgsql-jdbc
From | Markus Schaber |
---|---|
Subject | Re: Using a COPY...FROM through JDBC? |
Date | |
Msg-id | 20040607000800.636d7353@kingfisher.intern.logi-track.com Whole thread Raw |
In response to | Using a COPY...FROM through JDBC? (Steve Wampler <swampler@noao.edu>) |
Responses |
Re: Using a COPY...FROM through JDBC?
|
List | pgsql-jdbc |
Hi, Steve, On Sat, 05 Jun 2004 13:12:29 -0700 Steve Wampler <swampler@noao.edu> wrote: > I've got a simple database (no indices, 6 columns) that I need > to write data quickly into through JDBC connections from > multiple such connections simultaneously in a distributed > environment. (This is going to be a message logging service > for software generated messages.) > Using a PreparedStatement, I can get about 400/s inserted. If I > (on the java side) buffer up the entries and dump them in large > transaction blocks I can push this up to about 1200/s. I'd > like to go faster. One approach that I think might be > promising would be to try using a COPY command instead of > an INSERT, but I don't have a file for input, I have a > Java collection, so COPY isn't quite right. Is there anyway to > efficiently use COPY without having to create a file (remember > that the java apps are distributed on a LAN and aren't running > on the DB server.) Is this a dead end because of the way > COPY is implemented to only use a file? We also found that using the psql frontend, using COPY seems to give a factor 10 or more speedup. Sadly, as far as I learned, the current JDBC driver does not support COPY ... FROM STDIN. As a very bad workaround, it might be acceptable to use Runtime.exec() to start the psql command line tool, and issue the statement there, or even add a C-lib via JNI. Of course, the best "workaround" would be to implement COPY support for the driver, and send the Patches to the PGJDBC team for inclusion :-) We also had to do some trickery to get instertion of lots of rows fast. We dit lots of benchmarks, and currently use the following method: Our input data is divided into chunks (the optimal size depends on the machine, and seems to be between 250 and 3000). As the current pgjdbc preparedStatements implementation just does a text replacement, but we wantedto get the last bit of speed out of the machine, we issue a "PREPARE" statement for the insertion on connection setup, and then addBatch() a "EXECUTE blubb (data, row, values)" statement. Then we have several concurrent threads, all running essentially a {get batch, write batch, commit} loop on their own connection. Increasing the thread number to more than three did not show further substantial performance improvements. This lead us to the conclusion that concurrency can compensate for the time the postmaster is forced to wait while it syncs the WAL to disk, but there's still a concurrency limit inside of postgres for inserts (I presume they have to lock at some times, the multiversioning seems not to cover inserts very well). Also, we surprisingly found that setting the transaction isolation to "serializable" can speed things remarkably in some cases... > Is there something else I can do? Ultimately, this will end > up on a machine running 1+0 RAID, so I expect that will give > me some performance boost as well, but I'd like to push it > up as best I can with my current hardware setup. As any sane setup runs with syncing enabled in the backend, and each sync (and so each commit) at least has to write at least one block, you can calculate the theoretical maximum number of commits your machine can achieve. If you have 15k rpm disks (AFAIK, the fastest one currently available), they spin at 250 rotations per second, so you cannot have more than 250 commits per second. Regarding the fact that your machine has to do some works between the sync() calls (e. G. processing the whole next batch), it is very likely that it misses the next turn, so that you're likely to get a factor 2 or 3 number in reality. One way to overcome this limit is using multiple writer threads, and (having a highly capable I/O sybsystem) enabling commit delay in your backend so that you can have more than one commit during the same write operation. It might also help to put the WAL log to a different disk (just link or mount or mount --bind the appropriate subdirectory in your database), or even put the indices on a third disk (needs ugly trickery) - it's a shame that postmaster does not really support this techniques which are considered standard in any professional database. If you really need much more speed, that you could try to put the WAL on a Solid State Disk (essentially a battery-backed RAM) so you can overcome this physical limit, or (if you really trust your hardware and your power supply) put the WAL into a RAMDISK or switch of syncing in your postmaster configuration. One thing you should check is whether I/O or CPU is the limiting factor. If you have a cpu utilization higher than 90%, than all the tricks I told you won't help much. (But using COPY still could help a lot as it cut's down the CPU usage very much.) We tested with two machines, a single-processor developer machine, and a 2-way 64-Bit Itanium SMP machine. On the desktop machine, a single thread already utilized 80% CPU, and so only small improvement was possible using 2 or more threads. On the SMP machine, we had substantial improvements using 2 or 3 threads, but then going up to 8 threads gave no more remarkable speedup constantly utilizing about 120% CPU (Remember we have a 2-way machine). I think that there are some internal postgres locks that prohibit further concurrency for inserts in the same table. > Thanks for any advice! Hope, that helps, Markus Schaber -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
pgsql-jdbc by date: