Thread: Newbie dbadmin out of his league

Newbie dbadmin out of his league

From

Tim Pizey

Date:

27 August 1999, 16:13:40

Hi all, 

I am new to dbadmin, and have to load 2 million records into a db 
every month :(

So far we have found that a naive approach will result in the 
data taking longer to load than it is valid for. 

Initially we were using JDBC to insert or update the table. 
Now we are only using JDBC to insert new category records, such as when a
new format or country are encountered. 
The actual INSERT or UPDATE statements are written out to an SQL file. 

To create the SQL file is estimated at 30 hours, but to updating the 
database with:

psql -f output.sql db 

looks as though it will take forever as it has inserted 177778 records in
43 hours. 

I have just dropped a text index to help things along. 

So two questions: 

1. The primary key is an autoincrement field called id:  
CREATE SEQUENCE parts_id_seq start 1 increment 1 maxvalue 2147483647
minvalue 1 cache 1 ;
SELECT nextval ( 'parts_id_seq' );
CREATE TABLE "parts" ("id" int4 DEFAULT nextval ( 'parts_id_seq' ) NOT NULL,"part_no" character(50) NOT NULL,"deleted"
bool,"updatedate"timestamp
 
);

  The Insert statements do not give the id explicitly.  Would I be better off dropping the auto-increment and then
re-applyingit?  How do I do this?
 

2. How do I use COPY?
  Presumably I again have to take control of the key values,  drop the index  copy from tab delimited file containing
hardids  create id sequence  create index  modify id definition
 


I would be very grateful for any comments and any longer term help 
can be paid for. 

yours
Tim Pizey
http://www.paneris.co.uk/

Re: [SQL] Newbie dbadmin out of his league

From

Tom Lane

Date:

27 August 1999, 17:20:57

Tim Pizey <tim@paneris.co.uk> writes:
> 2. How do I use COPY?

>    Presumably I again have to take control of the key values,
>    drop the index
>    copy from tab delimited file containing hard ids
>    create id sequence
>    create index
>    modify id definition

Yup, that's about what you need to do.  You can leave the "DEFAULT"
clause where it is, since it won't be invoked during a COPY that's
supplying non-default values for the ID column.  (A good thing too,
since I don't think we support ALTER TABLE ADD DEFAULT...)

Just create the sequence with the right starting value (one past last
ID being loaded).  You can do that before or after the COPY, doesn't
matter.

A COPY will be way faster than a series of INSERT commands, especially
if each INSERT is invoking a nextval().  nextval() is great for
serializing live updates but it's pretty inefficient for a bulk-loading
situation.  Creating the index after the load is reputed to be faster
than building it incrementally, as well.

BTW, I believe pg_dump gets this right, so you could look at the script
generated by pg_dump of a small sample table for details.
        regards, tom lane

Re: [SQL] Newbie dbadmin out of his league

From

"Ross J. Reedstrom"

Date:

27 August 1999, 18:11:42

On Fri, Aug 27, 1999 at 05:19:18PM -0400, Tom Lane wrote:
> Tim Pizey <tim@paneris.co.uk> writes:
> > 2. How do I use COPY?
> 
> 
> Just create the sequence with the right starting value (one past last
> ID being loaded).  You can do that before or after the COPY, doesn't
> matter.

What I always do with sequences that are autocreated for serial types 
(and so have a start value of 1), after the COPY puts all my data in place:

select setval('tablename_field_seq',max(field)) from tablename;

Ross

-- 
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> 
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St.,  Houston, TX 77005

Re: [SQL] Newbie dbadmin out of his league

From

Herouth Maoz

Date:

01 September 1999, 11:44:43

At 00:19 +0300 on 28/08/1999, Tom Lane wrote:

> >    Presumably I again have to take control of the key values,
> >    drop the index
> >    copy from tab delimited file containing hard ids
> >    create id sequence
> >    create index
> >    modify id definition
>
> Yup, that's about what you need to do.  You can leave the "DEFAULT"
> clause where it is, since it won't be invoked during a COPY that's
> supplying non-default values for the ID column.  (A good thing too,
> since I don't think we support ALTER TABLE ADD DEFAULT...)

Hmmm. If it were I, I would have tackled it in a slightly different way:

COPY the data into a temporary table, that doesn't have the id numbers at
all. Thus you don't have to have a counter on the client side, that knows
the last id that's already on the table, etc, etc., and you also don't have
to transfer several extra bytes per row through the postgres port.

Then, when you have a temp table, you can add the values to the main table
with an

INSERT INTO main_table (field1, field2, field3)
SELECT field1, field2, field3
FROM temp_table;

If you don't mention the field that carries the default in this INSERT
statement, it will invoke the default. Dropping the index may still be a
good idea. The temp table shouldn't have an index anyways.

This would save you at least the three last steps in your "recipe".

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma

Re: [SQL] Newbie dbadmin out of his league

From

Tim Pizey

Date:

02 September 1999, 06:26:30

At 17:11 27/08/99 -0500, Ross J. Reedstrom wrote:
>On Fri, Aug 27, 1999 at 05:19:18PM -0400, Tom Lane wrote:
>> Tim Pizey <tim@paneris.co.uk> writes:
>> > 2. How do I use COPY?
>> 
>> 
>> Just create the sequence with the right starting value (one past last
>> ID being loaded).  You can do that before or after the COPY, doesn't
>> matter.
>
>What I always do with sequences that are autocreated for serial types 
>(and so have a start value of 1), after the COPY puts all my data in place:
>
>select setval('tablename_field_seq',max(field)) from tablename;
>
Thanks Ross, I tried this but the max() function takes as long as 
anything else as far as I can see.

yours
Tim Pizey
Happy to take part://
www.paneris.co.uk/

Re: [SQL] Newbie dbadmin out of his league

From

Tim Pizey

Date:

02 September 1999, 06:37:30

At 18:45 01/09/99 +0300, Herouth Maoz wrote:
>At 00:19 +0300 on 28/08/1999, Tom Lane wrote:
>
>
>> >    Presumably I again have to take control of the key values,
>> >    drop the index
>> >    copy from tab delimited file containing hard ids
>> >    create id sequence
>> >    create index
>> >    modify id definition
>>
>> Yup, that's about what you need to do.  You can leave the "DEFAULT"
>> clause where it is, since it won't be invoked during a COPY that's
>> supplying non-default values for the ID column.  (A good thing too,
>> since I don't think we support ALTER TABLE ADD DEFAULT...)
>
>Hmmm. If it were I, I would have tackled it in a slightly different way:
>
>COPY the data into a temporary table, that doesn't have the id numbers at
>all. Thus you don't have to have a counter on the client side, that knows
>the last id that's already on the table, etc, etc., and you also don't have
>to transfer several extra bytes per row through the postgres port.
>
Does the volume of data really matter, I was assuming that it was the
indexing that was taking the time. Many of my field sizes are generous and
padded with spaces. 

>Then, when you have a temp table, you can add the values to the main table
>with an
>
>INSERT INTO main_table (field1, field2, field3)
>SELECT field1, field2, field3
>FROM temp_table;
>
>If you don't mention the field that carries the default in this INSERT
>statement, it will invoke the default. Dropping the index may still be a
>good idea. The temp table shouldn't have an index anyways.
>
>This would save you at least the three last steps in your "recipe".
>
Thanks a lot for this, I will try it next. 

At the moment the job is going much faster, but is getting stuck on the
index creation.

I am turning off fsync (-o -F) during the copy, which copies all 2000000 
records in one COPY. Should I keep it off during the index creation?

It looks increasingly as though this process is going to take an un
acceptable amount of time and resources on a live server. 

What is the recommended way of creating a db on another machine and then 
hot swapping the live and the new dbs?

yours
Tim Pizey
Happy to take part:/
/www.paneris.co.uk/