Non-linear Performance - Mailing list pgsql-general
From | Curt Sampson |
---|---|
Subject | Non-linear Performance |
Date | |
Msg-id | Pine.NEB.4.43.0205301929110.455-100000@angelic.cynic.net Whole thread Raw |
Responses |
Re: Non-linear Performance
|
List | pgsql-general |
I'm noticing that performance in creating a particular index, and also a little bit in a simple query, seems somewhat non-linear, and I'm wondering if anybody could give me any clues as to what might be causing this, and how I might fix it, if that's possible. I've done a COPY in and index build on three data sets (just integer data) consisting of 10m rows (500 MB table), 100m rows (5 GB table), and 500m rows (25 GB table). (This is all on a 1.4 GHz P4, 512 MB RAM, two 7200 RPM IDE drives, one for data and one for log.) The COPY command to do the import is pretty linear, at about 230 sec., 2000 sec and 10,000 sec. for the 10m, 100m and 500m row tables. Neither disk I/O nor CPU seem to be pushed particularly, though I'm getting lots of small writes to the log files. I tried using a larger checkpoint size, but it didn't seem to help any. Recreating the primary key (on an INT, and of course all values unique) took 123, 1300 and 6700 seconds, again pretty linear. The next column is a DATE, which is the same value for the first 3.5M records, the next day for the next 3.5M records, and so on, incrementing a day for every 3.5M records (in order). This index build took about 60, 700 and 3500 seconds, respectively, again linear. But then I get to the next INT column which in every row is filled in with a random value between 0 and 99,999. This index takes about 175, 3600, and 28,000 seconds seconds, respectively, to generate. So it take about 2x as long per record going from 10m to 100m records, and about 1.5x as long again per record when going from 100m to 500m records. Queries using that index seem to do this too, though not quite as badly. Using a very simple query such as "SELECT COUNT(*) FROM table WHERE value = 12345" (where value is the last INT column above that took ages to index), typical query times (including connection overhead) for data not in the cache are 0.6 sec., 11 sec. and 72 sec. This query, as expected, is completely dominated by random IO; the disk the table is on sits there at 100% usage (i.e., disk requests outstanding 100% of the time) and not much else is happening at all. It does seem to do a few more more disk transfers than I would really expect. I get back a count of around 4000-5000, which to me implies about 5000 reads plus the index reads (which one would think would not amount to more than one or two hundred pages), yet 110 I/O requests per second times 70 seconds implies about 7000 reads. Is there something I'm missing here? (If the query plan and analysis is any help, here it is: Aggregate (cost=22081.46..22081.46 rows=1 width=0) (actual time=70119.88..70119.88 rows=1 loops=1) -> Index Scan using data_3_value on data_3 (cost=0.00..22067.71 rows=5498 width=0) (actual time=38.70..70090.45 rows=4980 loops=1) Total runtime: 70121.74 msec Anyway, I'm open to any thoughts on this. In particular, I'm open to suggestions for cheap ways of dealing with this horrible random I/O load. (Yeah, yeah, I know: disk array, and SCSI while I'm about it. But I've been asked to get this sort of thing working fast on systems much cheaper than the current Sun/Sybase/EMC or whatever it is that they're using.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
pgsql-general by date: