Thread: Newbie dbadmin out of his league
Hi all, I am new to dbadmin, and have to load 2 million records into a db every month :( So far we have found that a naive approach will result in the data taking longer to load than it is valid for. Initially we were using JDBC to insert or update the table. Now we are only using JDBC to insert new category records, such as when a new format or country are encountered. The actual INSERT or UPDATE statements are written out to an SQL file. To create the SQL file is estimated at 30 hours, but to updating the database with: psql -f output.sql db looks as though it will take forever as it has inserted 177778 records in 43 hours. I have just dropped a text index to help things along. So two questions: 1. The primary key is an autoincrement field called id: CREATE SEQUENCE parts_id_seq start 1 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; SELECT nextval ( 'parts_id_seq' ); CREATE TABLE "parts" ("id" int4 DEFAULT nextval ( 'parts_id_seq' ) NOT NULL,"part_no" character(50) NOT NULL,"deleted" bool,"updatedate"timestamp ); The Insert statements do not give the id explicitly. Would I be better off dropping the auto-increment and then re-applyingit? How do I do this? 2. How do I use COPY? Presumably I again have to take control of the key values, drop the index copy from tab delimited file containing hardids create id sequence create index modify id definition I would be very grateful for any comments and any longer term help can be paid for. yours Tim Pizey http://www.paneris.co.uk/
Tim Pizey <tim@paneris.co.uk> writes: > 2. How do I use COPY? > Presumably I again have to take control of the key values, > drop the index > copy from tab delimited file containing hard ids > create id sequence > create index > modify id definition Yup, that's about what you need to do. You can leave the "DEFAULT" clause where it is, since it won't be invoked during a COPY that's supplying non-default values for the ID column. (A good thing too, since I don't think we support ALTER TABLE ADD DEFAULT...) Just create the sequence with the right starting value (one past last ID being loaded). You can do that before or after the COPY, doesn't matter. A COPY will be way faster than a series of INSERT commands, especially if each INSERT is invoking a nextval(). nextval() is great for serializing live updates but it's pretty inefficient for a bulk-loading situation. Creating the index after the load is reputed to be faster than building it incrementally, as well. BTW, I believe pg_dump gets this right, so you could look at the script generated by pg_dump of a small sample table for details. regards, tom lane
On Fri, Aug 27, 1999 at 05:19:18PM -0400, Tom Lane wrote: > Tim Pizey <tim@paneris.co.uk> writes: > > 2. How do I use COPY? > > > Just create the sequence with the right starting value (one past last > ID being loaded). You can do that before or after the COPY, doesn't > matter. What I always do with sequences that are autocreated for serial types (and so have a start value of 1), after the COPY puts all my data in place: select setval('tablename_field_seq',max(field)) from tablename; Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
At 00:19 +0300 on 28/08/1999, Tom Lane wrote: > > Presumably I again have to take control of the key values, > > drop the index > > copy from tab delimited file containing hard ids > > create id sequence > > create index > > modify id definition > > Yup, that's about what you need to do. You can leave the "DEFAULT" > clause where it is, since it won't be invoked during a COPY that's > supplying non-default values for the ID column. (A good thing too, > since I don't think we support ALTER TABLE ADD DEFAULT...) Hmmm. If it were I, I would have tackled it in a slightly different way: COPY the data into a temporary table, that doesn't have the id numbers at all. Thus you don't have to have a counter on the client side, that knows the last id that's already on the table, etc, etc., and you also don't have to transfer several extra bytes per row through the postgres port. Then, when you have a temp table, you can add the values to the main table with an INSERT INTO main_table (field1, field2, field3) SELECT field1, field2, field3 FROM temp_table; If you don't mention the field that carries the default in this INSERT statement, it will invoke the default. Dropping the index may still be a good idea. The temp table shouldn't have an index anyways. This would save you at least the three last steps in your "recipe". Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
At 17:11 27/08/99 -0500, Ross J. Reedstrom wrote: >On Fri, Aug 27, 1999 at 05:19:18PM -0400, Tom Lane wrote: >> Tim Pizey <tim@paneris.co.uk> writes: >> > 2. How do I use COPY? >> >> >> Just create the sequence with the right starting value (one past last >> ID being loaded). You can do that before or after the COPY, doesn't >> matter. > >What I always do with sequences that are autocreated for serial types >(and so have a start value of 1), after the COPY puts all my data in place: > >select setval('tablename_field_seq',max(field)) from tablename; > Thanks Ross, I tried this but the max() function takes as long as anything else as far as I can see. yours Tim Pizey Happy to take part:// www.paneris.co.uk/
At 18:45 01/09/99 +0300, Herouth Maoz wrote: >At 00:19 +0300 on 28/08/1999, Tom Lane wrote: > > >> > Presumably I again have to take control of the key values, >> > drop the index >> > copy from tab delimited file containing hard ids >> > create id sequence >> > create index >> > modify id definition >> >> Yup, that's about what you need to do. You can leave the "DEFAULT" >> clause where it is, since it won't be invoked during a COPY that's >> supplying non-default values for the ID column. (A good thing too, >> since I don't think we support ALTER TABLE ADD DEFAULT...) > >Hmmm. If it were I, I would have tackled it in a slightly different way: > >COPY the data into a temporary table, that doesn't have the id numbers at >all. Thus you don't have to have a counter on the client side, that knows >the last id that's already on the table, etc, etc., and you also don't have >to transfer several extra bytes per row through the postgres port. > Does the volume of data really matter, I was assuming that it was the indexing that was taking the time. Many of my field sizes are generous and padded with spaces. >Then, when you have a temp table, you can add the values to the main table >with an > >INSERT INTO main_table (field1, field2, field3) >SELECT field1, field2, field3 >FROM temp_table; > >If you don't mention the field that carries the default in this INSERT >statement, it will invoke the default. Dropping the index may still be a >good idea. The temp table shouldn't have an index anyways. > >This would save you at least the three last steps in your "recipe". > Thanks a lot for this, I will try it next. At the moment the job is going much faster, but is getting stuck on the index creation. I am turning off fsync (-o -F) during the copy, which copies all 2000000 records in one COPY. Should I keep it off during the index creation? It looks increasingly as though this process is going to take an un acceptable amount of time and resources on a live server. What is the recommended way of creating a db on another machine and then hot swapping the live and the new dbs? yours Tim Pizey Happy to take part:/ /www.paneris.co.uk/