Thread: INSERT extremely slow with large data sets
Hi Everyone, This is my first post here so please tell me to go somewhere else if this is the wrong place to post questions like this. I am using PostgreSQL 7.3.2 and have used earlier versions (7.1.x onwards) and with all of them I noticed same problem with INSERTs when there is a large data set. Just to so you guys can compare time it takes to insert one row into a table when there are only few rows present and when there are thousands: Rows Present Start Time Finish Time ------------------------------------------------------------ 100 1068790804.12 1068790804.12 1000 1068790807.87 1068790807.87 5000 1068790839.26 1068790839.27 10000 1068790909.24 1068790909.26 20000 1068791172.82 1068791172.85 30000 1068791664.06 1068791664.09 40000 1068792369.94 1068792370.0 50000 1068793317.53 1068793317.6 60000 1068794369.38 1068794369.47 As you can see if takes awfully lots of time for me just to have those values inserted. Now to make a picture a bit clearer for you this table has lots of information in there, about 25 columns. Also there are few indexes that I created so that the process of selecting values from there is faster which by the way works fine. Selecting anything takes under 5 seconds. Any help would be greatly appreciated even pointing me in the right direction where to ask this question. By the way I designed the database this way as my application that uses PGSQL a lot during the execution so there was a huge need for fast SELECTs. Our experiments are getting larger and larger every day so fast inserts would be good as well. Just to note those times above are of INSERTs only. Nothing else done that would be included in those times. Machine was also free and that was the only process running all the time and the machine was Intel(R) Pentium(R) 4 CPU 2.40GHz. Regards, Slavisa
On Fri, Nov 14, 2003 at 06:36:41PM +1100, Slavisa Garic wrote: > Rows Present Start Time Finish Time > ------------------------------------------------------------ > 100 1068790804.12 1068790804.12 > 1000 1068790807.87 1068790807.87 > 5000 1068790839.26 1068790839.27 > 10000 1068790909.24 1068790909.26 > 20000 1068791172.82 1068791172.85 > 30000 1068791664.06 1068791664.09 > 40000 1068792369.94 1068792370.0 > 50000 1068793317.53 1068793317.6 > 60000 1068794369.38 1068794369.47 > [too slow] Ok, so inserting 60000 rows seems to take 0.09 seconds, and inserting 5000 takes only 0.01. And your problem is exactly what? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Coge la flor que hoy nace alegre, ufana. ¿Quién sabe si nacera otra mañana?"
> -----Original Message----- > From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au] > Sent: Thursday, November 13, 2003 11:37 PM > To: pgsql-hackers@postgresql.org > Subject: [HACKERS] INSERT extremely slow with large data sets > > > Hi Everyone, > > This is my first post here so please tell me to go somewhere > else if this is the wrong place to post questions like this. > > I am using PostgreSQL 7.3.2 and have used earlier versions > (7.1.x onwards) and with all of them I noticed same problem > with INSERTs when there is a large data set. Just to so you > guys can compare time it takes to insert one row into a table > when there are only few rows present and when there are thousands: > > Rows Present Start Time Finish Time > ------------------------------------------------------------ > 100 1068790804.12 1068790804.12 > 1000 1068790807.87 1068790807.87 > 5000 1068790839.26 1068790839.27 > 10000 1068790909.24 1068790909.26 > 20000 1068791172.82 1068791172.85 > 30000 1068791664.06 1068791664.09 > 40000 1068792369.94 1068792370.0 > 50000 1068793317.53 1068793317.6 > 60000 1068794369.38 1068794369.47 > > As you can see if takes awfully lots of time for me just to > have those values inserted. Now to make a picture a bit > clearer for you this table has lots of information in there, > about 25 columns. Also there are few indexes that I created > so that the process of selecting values from there is faster > which by the way works fine. Selecting anything takes under 5 seconds. > > Any help would be greatly appreciated even pointing me in the > right direction where to ask this question. By the way I > designed the database this way as my application that uses > PGSQL a lot during the execution so there was a huge need for > fast SELECTs. Our experiments are getting larger and larger > every day so fast inserts would be good as well. > > Just to note those times above are of INSERTs only. Nothing > else done that would be included in those times. Machine was > also free and that was the only process running all the time > and the machine was Intel(R) Pentium(R) 4 CPU 2.40GHz. You should post the schema for the table in question when you ask a question like this. The behavior is not surprising in the least bit. Every database will perform in this way, since you have mentioned that you have indexes on the table. The depth of the tree will be proportional to the log of the row count. As the tree gets deeper, inserts will be more and more expensive. If you have a giant pile of stuff to insert, consider the COPY command or API if it is time critical.
On Fri, 14 Nov 2003, Alvaro Herrera wrote: > On Fri, Nov 14, 2003 at 06:36:41PM +1100, Slavisa Garic wrote: > > > Rows Present Start Time Finish Time > > ------------------------------------------------------------ > > 100 1068790804.12 1068790804.12 > > 1000 1068790807.87 1068790807.87 > > 5000 1068790839.26 1068790839.27 > > 10000 1068790909.24 1068790909.26 > > 20000 1068791172.82 1068791172.85 > > 30000 1068791664.06 1068791664.09 > > 40000 1068792369.94 1068792370.0 > > 50000 1068793317.53 1068793317.6 > > 60000 1068794369.38 1068794369.47 > > > [too slow] > > Ok, so inserting 60000 rows seems to take 0.09 seconds, and inserting > 5000 takes only 0.01. And your problem is exactly what? You didn't understand the question. Inserting ONE ROW when there are already 5000 ROWS present takes 0.01 seconds. Inserting ONE ROW when there are already 60000 ROWS present takes 0.09 secods. In other words in takes about 9 times more time to insert ONE ROW when there is a larger set of data already in the database. As my experiments will grow and more data will be inserted this is getting to take too long. Inserting 70000 rows takes about just over an hour. INserting 5000 takes about minute and an half. I don't know if this the behaviour to be expected so that is why i posted the question and that is my problem. I also wanted to know what can be done to improve this if it can be, Regards, Slavisa > -- > Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) > "Coge la flor que hoy nace alegre, ufana. �Qui�n sabe si nacera otra ma�ana?" >
Hi Dann Here is the schema and also could you just be more specific on COPY command. ALso does talking dirrectly to API speed things up ? (I am new to databases but i am learning quickly) -- NimrodEnfJob --create table NimrodEnfJob( exp_id INTEGER not null references NimrodEnfExperiment, task_id INTEGERnot null references NimrodTask, pgroup_id INTEGER not null references NimrodParameterGroup, agent_id INTEGERreferences NimrodAgent on delete set null, jobname varchar(255) not null, admin char(1) not null default 'F' check (admin in ('F', 'T')), taskname varchar(255) not null, param_text TEXT not null, open char(1) not null default 'F' check (open in ('F', 'T')), control varchar(8) not null default 'start' check (control in ('nostart', 'start', 'stop')), status varchar(16) not null default 'ready' check (statusin ('ready', 'executing', 'failed', 'done')), cpulength real not null, sleeptime real notnull, filesize real not null, cputime real, waittime real, filetime real, filebytes integer, priority integer not null default 100, create_time timestamp not null default CURRENT_TIMESTAMP, start_timetimestamp, finish_time timestamp, budget real not null default 0.0, servername varchar(255), error_info varchar(255) not null default '', more_info TEXT not null default '', primary key (exp_id, jobname), foreign key (exp_id, taskname) references NimrodEnfTask); Also these are the indexes on this table. I created them on the columnt that are most commonly accessed:create unique index nej_idxON NimrodEnfJob (exp_id, pgroup_id);create unique index nej_idx1ONNimrodEnfJob (pgroup_id);create index nej_idx2ON NimrodEnfJob (status);create unique index nej_idx3ON NimrodEnfJob(status, pgroup_id);create index nej_idx4ON NimrodEnfJob (status, agent_id);create index nej_idx5ON NimrodEnfJob(agent_id); I did notice that removing those indexes doesn't import by much. Similar behaviour is observed but it just takes a bit less time to insert (0.01 less then usually at 60000 records) Regards, Slavisa On Fri, 14 Nov 2003, Dann Corbit wrote: > > -----Original Message----- > > From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au] > > Sent: Thursday, November 13, 2003 11:37 PM > > To: pgsql-hackers@postgresql.org > > Subject: [HACKERS] INSERT extremely slow with large data sets > > > > > > Hi Everyone, > > > > This is my first post here so please tell me to go somewhere > > else if this is the wrong place to post questions like this. > > > > I am using PostgreSQL 7.3.2 and have used earlier versions > > (7.1.x onwards) and with all of them I noticed same problem > > with INSERTs when there is a large data set. Just to so you > > guys can compare time it takes to insert one row into a table > > when there are only few rows present and when there are thousands: > > > > Rows Present Start Time Finish Time > > ------------------------------------------------------------ > > 100 1068790804.12 1068790804.12 > > 1000 1068790807.87 1068790807.87 > > 5000 1068790839.26 1068790839.27 > > 10000 1068790909.24 1068790909.26 > > 20000 1068791172.82 1068791172.85 > > 30000 1068791664.06 1068791664.09 > > 40000 1068792369.94 1068792370.0 > > 50000 1068793317.53 1068793317.6 > > 60000 1068794369.38 1068794369.47 > > > > As you can see if takes awfully lots of time for me just to > > have those values inserted. Now to make a picture a bit > > clearer for you this table has lots of information in there, > > about 25 columns. Also there are few indexes that I created > > so that the process of selecting values from there is faster > > which by the way works fine. Selecting anything takes under 5 seconds. > > > > Any help would be greatly appreciated even pointing me in the > > right direction where to ask this question. By the way I > > designed the database this way as my application that uses > > PGSQL a lot during the execution so there was a huge need for > > fast SELECTs. Our experiments are getting larger and larger > > every day so fast inserts would be good as well. > > > > Just to note those times above are of INSERTs only. Nothing > > else done that would be included in those times. Machine was > > also free and that was the only process running all the time > > and the machine was Intel(R) Pentium(R) 4 CPU 2.40GHz. > > You should post the schema for the table in question when you ask a > question like this. > > The behavior is not surprising in the least bit. Every database will > perform in this way, since you have mentioned that you have indexes on > the table. > > The depth of the tree will be proportional to the log of the row count. > As the tree gets deeper, inserts will be more and more expensive. > > If you have a giant pile of stuff to insert, consider the COPY command > or API if it is time critical. >
> -----Original Message----- > From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au] > Sent: Friday, November 14, 2003 5:12 PM > To: Dann Corbit > Cc: Slavisa Garic; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] INSERT extremely slow with large data sets > > > Hi Dann > > Here is the schema and also could you just be more specific > on COPY command. http://www.postgresql.org/docs/7.3/static/sql-copy.html And http://techdocs.postgresql.org/techdocs/usingcopy.php May be helpful. > ALso does talking dirrectly to API speed > things up ? (I am new to databases but i am learning quickly) Not particularly. This is the copy command API: http://www.postgresql.org/docs/7.3/static/libpq-copy.html What the API can allow you to do (for instance) would be to never let the data touch the ground. Instead of writing to a text file or even a binary format copy input file, you use the API to take the incoming data and insert it directly. Like everything else, there is a dark side. Read the documents and they will explain it. But if you need to move a giant pile of data into the database as fast as possible, it is the copy command that is the most efficient. > -- NimrodEnfJob -- > > create table NimrodEnfJob( > exp_id INTEGER not null references NimrodEnfExperiment, > task_id INTEGER not null references NimrodTask, > pgroup_id INTEGER not null references > NimrodParameterGroup, > agent_id INTEGER references NimrodAgent on > delete set null, > jobname varchar(255) not null, > admin char(1) not null default 'F' > check (admin in ('F', 'T')), > taskname varchar(255) not null, > param_text TEXT not null, > open char(1) not null default 'F' > check (open in ('F', 'T')), > control varchar(8) not null default 'start' > check (control in ('nostart', 'start', 'stop')), > status varchar(16) not null default 'ready' > check (status in ('ready', 'executing', 'failed', > 'done')), > cpulength real not null, > sleeptime real not null, > filesize real not null, > cputime real, > waittime real, > filetime real, > filebytes integer, > priority integer not null default 100, > create_time timestamp not null default CURRENT_TIMESTAMP, > start_time timestamp, > finish_time timestamp, > budget real not null default 0.0, > servername varchar(255), > error_info varchar(255) not null default '', > more_info TEXT not null default '', > primary key (exp_id, jobname), > foreign key (exp_id, taskname) references NimrodEnfTask > ); > > Also these are the indexes on this table. I created them on > the columnt that are most commonly accessed: > create unique index nej_idx > ON NimrodEnfJob (exp_id, pgroup_id); > > create unique index nej_idx1 > ON NimrodEnfJob (pgroup_id); > > create index nej_idx2 > ON NimrodEnfJob (status); > > create unique index nej_idx3 > ON NimrodEnfJob (status, pgroup_id); > > create index nej_idx4 > ON NimrodEnfJob (status, agent_id); > > create index nej_idx5 > ON NimrodEnfJob (agent_id); > > I did notice that removing those indexes doesn't import by > much. Similar behaviour is observed but it just takes a bit > less time to insert (0.01 less then usually at 60000 records) I am quite surprised that removing the indexes does not have a large impact on insert speed, especially, since you have 6 of them. Most of the other costs that I can think of are fixed for inserts into a "bare table". Perhaps someone with more intimate knowledge of the inner working may know why inserts into a table without any index will trail off in speed as the table grows. [snip]
Slavisa Garic <Slavisa.Garic@infotech.monash.edu.au> writes: > You didn't understand the question. Inserting ONE ROW when there are already > 5000 ROWS present takes 0.01 seconds. Inserting ONE ROW when there are > already 60000 ROWS present takes 0.09 secods. The numbers you presented didn't really offer any strong grounds for believing that there's an O(N) growth rate --- as far as I can see your results are only barely out of the measurement-noise category. Can you run some tests where the issue is not in the least significant digit of the available numbers? But assuming for the moment that you've got hold of a real problem... The actual insertion of a row should be essentially a constant-time operation, since we just stick it into the last page of the table (or any page with sufficient free space). Insertion of index entries for the row would have cost that depends on the number of existing table entries, but for btree indexes I'd expect the cost to vary as O(log2(N)) not O(N). I do not think you've presented enough evidence to prove that you're seeing linear rather than log-N cost growth. Most of the serious insertion-cost problems we've seen lately have to do with the costs of checking foreign key references ... but those normally vary with the size of the referenced table, not the table into which you're inserting. Besides which you mentioned nothing about foreign keys ... or any other schema details as far as I saw ... regards, tom lane