INSERT performance deteriorates quickly during a large import - Mailing list pgsql-general
From | Krasimir Hristozov \(InterMedia Ltd\) |
---|---|
Subject | INSERT performance deteriorates quickly during a large import |
Date | |
Msg-id | 02b101c82221$4f59ced0$0400000a@imediadev.com Whole thread Raw |
Responses |
Re: INSERT performance deteriorates quickly during a large
import
Re: INSERT performance deteriorates quickly during a large import Re: INSERT performance deteriorates quickly during a large import |
List | pgsql-general |
We need to import data from a relatively large MySQL database into an existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL and INSERTs in PostgreSQL. A part of the import involves moving about 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The problem is that the insert performance inevitably deteriorates as the number of inserts increases. We tried different approaches: * selecting only parts of the source table data based on a certain condition * selecting all of the source data coupled with either of these: * inserting without explicit transactions * inserting all the data in a single transaction * inserting the data in partial transactions of about 100, 1000, 10000, 100000 inserts each While there were performance benefits in some of the cases (selecting all the data and inserting in transaction packets of about 1000 each being the fastest), the problem was that it still deteriorated as the import progressed. We tried removing all foreign keys and indices from the postgres table, still gained performance, but it deteriorated as well. The latest (and best performing) test we did was under the following conditions: * 11851 pre-existing records in the destination table * the table was vacuumed just before the import * all foreign keys and indices were removed from the destination table * selected all of the data from the source table at once * inserted in transactions of 1000 inserts each We displayed time statistics on each 100 inserts. The process started at about 1 second per 100 inserts. This estimated to about 4 hours for the entire process. 14 hours later it had imported about a quarter of the data (a bit more than 330000 records), and 100 inserts now took nearly 40 seconds. We tested reading from MySQL alone, without inserting the data in Postgres. All records were read in about a minute and a half, so MySQL performance is not a part of the issue. The PHP script selects the MySQL data, fetches rows sequentially, occasionally performs a couple of selects against PostgreSQL data (which is cached in a PHP array to reduce the DB operations; no more than 80000 array elements, integer keys, integer data), and inserts into PostgreSQL. The algorithm seems to be linear in nature and perfomance deterioration most probably doesn't have to do with the PHP code. Has anyone had an issue like this, and perhaps a suggestion for a possible cause and solution? Is it common for PostgreSQL to grow so slow as the amount of data in the tables increases? If so, is it just the insert operation or all kinds of queries? Isn't 300000 records too low a threshold for such performance deterioration? Here are some technical details, that might be helpful: * PHP, MySQL and PostgreSQL all work on the same server, sharing the same memory and hard drive. * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor 3000+ (2GHz K8 class CPU) with 1GB RAM * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL 4.1.22, PostgreSQL 8.1.8 * postgresql.conf variables other than defaults are: max_connections = 40, shared_buffers = 1000 (this is the default) * we have also tried these on another server with Red Hat Enterprise Linux ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm) Processor 270 (4x2GHz logical CPUs) with 2GB RAM * both servers run in x86_64 mode, PostgreSQL footprint in memory stays relatively small, CPU usage maxes out on import, there is no resource starvation in any way DDL statement for the creation of the PostgreSQL table in question: CREATE TABLE "public"."sp_thread_replies" ( "id" SERIAL, "thread_id" INTEGER NOT NULL, "body" TEXT NOT NULL, "ts_added" INTEGER DEFAULT 0 NOT NULL, "user_id" INTEGER NOT NULL, "thread_offset" INTEGER DEFAULT 0, "approved" SMALLINT DEFAULT 1, "title" TEXT, "deleted" SMALLINT DEFAULT 0, "edit_reason" VARCHAR(255), "edit_user_id" INTEGER, "edit_time" INTEGER, CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"), CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id") REFERENCES "public"."sp_threads"("id") ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE, CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id") REFERENCES "public"."sp_users"("id") ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE ) WITH OIDS; The table is a part of a custom forum engine. It stores all thread posts. It's most often queried with SELECTs and INSERTSs, less often with UPDATEs, and records are deleted quite seldom in normal operation of the application (though we may delete records manually from the console from time to time).
pgsql-general by date: