Thread: SQL command speed
I am running PostgreSQL 7.0 on Red Hat Linux 6.2. I am fairly new to using PostgreSQL. I am in the process of comparing performance with an Oracle data base. I have converted and populated several tables from Oracle to PostgreSQL. To test the speed of the data bases, I wrote the following script using the PERL DBI: --- Start Script 1 --- #!/bin/perl use DBI; $dbh = DBI->connect("dbi:Pg:dbname=pbi", "ntm", "", { RaiseError => 1, AutoCommit => 0 }) or die "Can't connect to PGSQL"; $sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') FROM notam_details WHERE item_a = 'EGKB' OR item_a = 'EGDM' OR item_a = 'EGHH'OR item_a = 'EGGD' OR item_a = 'EGVN' OR item_a = 'EGFF' OR item_a = 'EGDC' OR item_a = 'EGTC' OR item_a = 'EGDR' OR item_a = 'EGTE' OR item_a = 'EGLF' OR item_a ='EGTG' OR item_a = 'EGBJ' OR item_a = 'EGLC' OR item_a = 'EGKK' OR item_a = 'EGLL' OR item_a = 'EGSS' OR item_a = 'EGGW' OR item_a = 'EGMD' OR item_a = 'EGDL' OR item_a= 'EGUM' OR item_a = 'EGHD' OR item_a = 'EGHE' OR item_a = 'EGKA' OR item_a = 'EGHI'OR item_a = 'EGMC' OR item_a = 'EGDG' OR item_a = 'EGFH' OR item_a = 'EGDY' OR item_a = 'EGJA' OR item_a = 'EGJB' OR item_a = 'EGJJ'"; $sth = $dbh->prepare( $sql); $sth->execute(); $result = $sth->fetchall_arrayref(); foreach (@{$result}) { ($id, $date) = @$_; print "$id:$date\n"; } --- End Script 1 --- When I ran it it took 12 seconds with PostgreSQL and 1 second in Oracle. I then went through several steps using vacuum, building indexes, etc, but I did not see much of a performance improvement. Then I used explain, and noticed that the query was not using the indexes I created. I did some experimentation, and if the WHERE clause had one or two items it would use the index; more and it would not. So I rewrote the script to do multiple small select queries instead of one big select query. The new script looked like this: --- Start Script 2 --- #!/bin/perl use DBI; $dbh = DBI->connect("dbi:Pg:dbname=pbi", "ntm", "", { RaiseError => 1, AutoCommit => 0 }) or die "Can't connect to PGSQL"; $sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') FROM notam_details WHERE item_a = ?"; $sth = $dbh->prepare( $sql); @stations = (EGKB, EGDM, EGHH, EGGD, EGVN, EGFF, EGDC, EGTC, EGDR, EGTE, EGLF, EGTG, EGBJ, EGLC, EGKK, EGLL, EGSS, EGGW, EGMD, EGDL, EGUM, EGHD, EGHE, EGKA, EGHI, EGMC, EGDG, EGFH, EGDY, EGJA, EGJB, EGJJ); foreach (@stations){ $sth->bind_param( 1, $_); $sth->execute(); $result = $sth->fetchall_arrayref(); foreach $s (@{$result}) { ($id, $date) = @$s; print "$id:$date\n"; }} --- End Script 2 --- The result was the execution time of the script dropped to 1 second using PostgreSQL! At first I thought it was a feature of the PERL DBI, but I ran the same queries using psql, and I got similar results. I also ran some timing checks of the PERL code and 99% of the execution time of the Script 1 is being spent in the "execute" statement. I have looked through the documentation of PostgreSQL, and I can find no explanation for this. I was curious if this is a know issue and thus is the proper way to create SELECT statements in PostgreSQL? Kate Collins BTW, Script 2 takes a little longer to run on the Oracle system, circa 1.3 seconds. -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com
I believe the PostgreSQL optimizer is fooled by many ORs and switches to sequential scans: it cannot estimate properly the quantity of results that will be returned. Try it in one go, as $sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') FROM notam_details WHERE item_a IN ('EGKB','EGDM','EGHH','EGGD','EGVN','EGFF', 'EGDC','EGTC','EGDR','EGTE','EGLF','EGTG', 'EGBJ','EGLC','EGKK','EGLL','EGSS','EGGW', 'EGMD','EGDL','EGUM','EGHD','EGHE','EGKA', 'EGHI','EGMC','EGDG','EGFH','EGDY','EGJA', 'EGJB','EGJJ')"; The optimizer should (I think) like this better and use the indices, without requiring that you iterate the queries from the frontend. As I am trying to learn these things too, I will appreciate knowing about the result of your tests: please keep me posted. Thanks Miguel Sofer
Miguel, Thank you for the reply. I created a third script with the syntax you suggested, and the result was similar to the first script, about 11.5 to 12 seconds. The interesting thing is that in all three cases, when I use the unix "time" command to time the execution, the "user" and "sys" portion of the execution is about the same. user = 0.20, sys = 0.02. This indicates to me that the bottle neck is not in the execution of the PERL, but maybe in the postmaster daemon. The system I am testing with is a PII, 400 mhz with 256 ram. It is not doing anything else at this time, but running these test. Interestingly enough under Oracle, the new script takes about the same time as the other two, 1-1.5 seconds. Kate Collins mig@utdt.edu wrote: > I believe the PostgreSQL optimizer is fooled by many ORs and switches > to sequential scans: it cannot estimate properly the quantity of > results that will be returned. > > Try it in one go, as > > $sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY > HH24:MI') > FROM notam_details > WHERE > item_a IN > ('EGKB','EGDM','EGHH','EGGD','EGVN','EGFF', > 'EGDC','EGTC','EGDR','EGTE','EGLF','EGTG', > 'EGBJ','EGLC','EGKK','EGLL','EGSS','EGGW', > 'EGMD','EGDL','EGUM','EGHD','EGHE','EGKA', > 'EGHI','EGMC','EGDG','EGFH','EGDY','EGJA', > 'EGJB','EGJJ')"; > > The optimizer should (I think) like this better and use the indices, > without requiring that you iterate the queries from the frontend. > > As I am trying to learn these things too, I will appreciate knowing > about the result of your tests: please keep me posted. > > Thanks > > Miguel Sofer -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com
Thanks for your reply; I am disappointed that it did not work the way I thought it would ... I suppose that you did VACUUM ANALYZE your table recently, right? If not, the stale table statistics may be playing a role too: do it, and try again. With this I have reached the outer boundaries of my "knowledge"; if it doesn't work out, I hope somebody else may be able to help. In any case, I would appreciate knowing more about your tests and the eventual outcome ... Miguel
Hey crew: I've got a relatively simple SQL problem. In a db backed web site we're building, I'd like to fill a dropdown box with the contents of a validation table, in this case research institutions. I want to sort them alphabetically, but using "library rules": i.e. skip inital articles, since we've a few 'The University of Foo" and "The Johns Hopkins University", for example. I thought I had it with this SQL: SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The'); Looked good, until I found 'Texas A&M University' sorting below York. Seems ltrim() removes inital charaters from the set of charaters, not inital strings, so I was sorting on 'xas A&M University' Anyone have some magic solution for this? Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
Miguel, I have re-run VACUUM ANALYZE. Unfortunately, it did not result in a speed up. Thank you for your suggestions and input. They have helped increase my understanding of how it works. I will wait and see if any body else can provide any suggestions or insight. Kate mig@utdt.edu wrote: > Thanks for your reply; I am disappointed that it did not work the way > I thought it would ... > > I suppose that you did VACUUM ANALYZE your table recently, right? If > not, the stale table statistics may be playing a role too: do it, and > try again. > > With this I have reached the outer boundaries of my "knowledge"; if it > doesn't work out, I hope somebody else may be able to help. In any > case, I would appreciate knowing more about your tests and the > eventual outcome ... > > Miguel -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com
Kate Collins <klcollins@wsicorp.com> writes: > I did some experimentation, and if the WHERE clause had one or two items > it would use the index; more and it would not. Kate, it is reasonable behavior for the planner to stop using indexscans when there are enough OR clauses. Each OR clause requires a separate indexscan and so eventually it'll be cheaper to just do one sequential scan over the whole table. What we appear to have here is a case of misestimation of the relative costs of index and sequential scans, leading the planner to switch too soon. Next question is why the misestimation. It's difficult to say anything without seeing your EXPLAIN results for different numbers of OR clauses. Also, how big is the table (how many rows) and how many rows do you actually get from the query? regards, tom lane
Sorry about the subject: I meant to change it on the last one. As usual, 5 minutes after posting, I came up with a solution, at least for my limited case of ignoring an inital 'The ': SELECT InstName FROM Institutions ORDER BY CASE WHEN strpos(InstName,'The ') = 1 THEN ltrim(InstName,'The ') ELSE InstName END; Note that this requires 7.0, since 6.X won't allow a CASE node in the ORDER BY position. Ross On Thu, May 18, 2000 at 03:39:09PM -0500, Ross J. Reedstrom wrote: > Hey crew: > I've got a relatively simple SQL problem. In a db backed web site > we're building, I'd like to fill a dropdown box with the contents of a > validation table, in this case research institutions. I want to sort them > alphabetically, but using "library rules": i.e. skip inital articles, > since we've a few 'The University of Foo" and "The Johns Hopkins > University", for example. > > I thought I had it with this SQL: > > SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The'); > > Looked good, until I found 'Texas A&M University' sorting below York. > > Seems ltrim() removes inital charaters from the set of charaters, not > inital strings, so I was sorting on 'xas A&M University' > > Anyone have some magic solution for this? > > Ross > -- > Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> > NSBRI Research Scientist/Programmer > Computer and Information Technology Institute > Rice University, 6100 S. Main St., Houston, TX 77005 >
I didn't see a function that would do what you wanted to do, but I guess you could do the trimming in a plpgsql function though... (Since I don't have a 7.0 system to test with, these were only tested on an old 6.5.1 database)... create function trimthe(text) returns text as ' begin if (substr($1, 1, 4) = \'The \' thenreturn substr($1, 5); end if; -- add other prefix checks here... return $1; end; ' language 'plpgsql'; and then select ordering by that... I guess a more general function could be: create function ltrimstr(text, text) returns text as ' beginif position($2 in $1) = 1 then return substr($1, char_length($2)+1); end if; return $1; end; ' language 'plpgsql'; ----- Original Message ----- From: "Ross J. Reedstrom" <reedstrm@wallace.ece.rice.edu> To: <pgsql-sql@postgresql.org> Sent: Thursday, May 18, 2000 1:39 PM Subject: Re: [SQL] SQL command speed > Hey crew: > I've got a relatively simple SQL problem. In a db backed web site > we're building, I'd like to fill a dropdown box with the contents of a > validation table, in this case research institutions. I want to sort them > alphabetically, but using "library rules": i.e. skip inital articles, > since we've a few 'The University of Foo" and "The Johns Hopkins > University", for example. > > I thought I had it with this SQL: > > SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The'); > > Looked good, until I found 'Texas A&M University' sorting below York. > > Seems ltrim() removes inital charaters from the set of charaters, not > inital strings, so I was sorting on 'xas A&M University' > > Anyone have some magic solution for this?
Tom, Thank you for your reply. The table I am using has 114600 total rows. The full query returns 1129 rows. Right now the table is static, i.e. I am not modifying it while I am running these tests. Here are the results of the EXPLAIN with the different numbers of OR's. ---QUERY 1, returns 1129 rows--- pbi=> explain pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') pbi-> FROM notam_details pbi-> WHERE pbi-> item_a = 'EGKB' OR item_a = 'EGDM' OR item_a = 'EGHH' OR pbi-> item_a = 'EGGD' OR item_a = 'EGVN' OR item_a = 'EGFF' OR pbi-> item_a = 'EGDC' OR item_a = 'EGTC' OR item_a = 'EGDR' OR pbi-> item_a = 'EGTE' OR item_a = 'EGLF' OR item_a = 'EGTG' OR pbi-> item_a = 'EGBJ' OR item_a = 'EGLC' OR item_a = 'EGKK' OR pbi-> item_a = 'EGLL' OR item_a = 'EGSS' OR item_a = 'EGGW' OR pbi-> item_a = 'EGMD' OR item_a = 'EGDL' OR item_a = 'EGUM' OR pbi-> item_a = 'EGHD' OR item_a = 'EGHE' OR item_a = 'EGKA' OR pbi-> item_a = 'EGHI' OR item_a = 'EGMC' OR item_a = 'EGDG' OR pbi-> item_a = 'EGFH' OR item_a = 'EGDY' OR item_a = 'EGJA' OR pbi-> item_a = 'EGJB' OR item_a = 'EGJJ'; NOTICE: QUERY PLAN: Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12) EXPLAIN --- QUERY 2, returns 11 rows --- pbi=> explain pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') pbi-> FROM notam_details pbi-> WHERE pbi-> item_a = 'EGKB'; NOTICE: QUERY PLAN: Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57 rows=927 width=12) EXPLAIN --- QUERY 3, returns 11 rows --- pbi=> explain pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI') pbi-> FROM notam_details pbi-> WHERE pbi-> item_a = 'EGKB' OR item_a = 'EGDM'; NOTICE: QUERY PLAN: Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12) EXPLAIN --- I have run VACUUM ANALYZE NOTAM_DETAILS and created an index on the item_a column. I have only been using PostgreSQL for about a week, so all of this is pretty new to me. I don't 100% understand how all of this works yet, so any insight you can provide will be appreciated. Kate Collins Tom Lane wrote > Kate Collins <klcollins@wsicorp.com> writes: > > I did some experimentation, and if the WHERE clause had one or two items > > it would use the index; more and it would not. > > Kate, it is reasonable behavior for the planner to stop using > indexscans when there are enough OR clauses. Each OR clause requires > a separate indexscan and so eventually it'll be cheaper to just do one > sequential scan over the whole table. What we appear to have here is > a case of misestimation of the relative costs of index and sequential > scans, leading the planner to switch too soon. Next question is why > the misestimation. It's difficult to say anything without seeing > your EXPLAIN results for different numbers of OR clauses. Also, how > big is the table (how many rows) and how many rows do you actually > get from the query? > > regards, tom lane -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com
Kate Collins <klcollins@wsicorp.com> writes: > The table I am using has 114600 total rows. The full query returns 1129 > rows. Right now the table is static, i.e. I am not modifying it while I am > running these tests. > Here are the results of the EXPLAIN with the different numbers of OR's. > ---QUERY 1, returns 1129 rows--- > pbi=> [ 32-or variant ] > Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12) > --- QUERY 2, returns 11 rows --- > pbi=> [ just one WHERE clause ] > Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57 > rows=927 width=12) > --- QUERY 3, returns 11 rows --- > pbi=> [ 2 ORed clauses ] > Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12) OK, the problem here is the disconnect between the planner's estimate of the number of returned rows (the "rows" number in EXPLAIN) and the actual result row count. If the query actually did need to pull nearly a thousand rows for each OR'd key, then using a seqscan for more than a couple of ORs would make sense. But in reality you don't have nearly that many rows per OR key, so an indexscan is needed. That estimate is largely driven by a single statistic collected by VACUUM ANALYZE, which is the frequency of the most common value in the item_a column. If the MCV is pretty common then the planner assumes that the column doesn't have very many distinct values, so you get a high estimate of the number of returned rows. Evidently you have a very common MCV for item_a, but the key values you are looking for aren't nearly that common. I have seen a number of cases where someone was using a dummy value (like 'N/A', or an empty string, etc) to indicate unknown data, and there were so many of these entries as to not merely be the MCV, but drive the MCV's frequency statistic far above the frequency of occurrence of any "real" value. This fools the planner into thinking that *all* the values are like that, and so it generates plans accordingly. The long-run solution is to gather more-detailed statistics, and that's in the TODO list for a future version. In the meantime, there is a workaround that you may be able to use: instead of a dummy value, store NULL for unknown entries. The statistics gatherer already accounts for NULLs separately, so a large fraction of NULLs won't fool the planner about the frequency of non-null values. Note: if you try this, be sure to re-run VACUUM ANALYZE after you replace the dummies with NULLs. The plans won't change until there's a more accurate statistical entry for your table... regards, tom lane
Tom, Thank you for the explanation. You are correct in your assessment of the nature of the distribution of the elements in the item_a column. Some values return none or a few rows. While other values return many rows (100 or more). It is not an even distribution. I had never really considered what effect this would have on searching before. Kate Tom Lane wrote: > Kate Collins <klcollins@wsicorp.com> writes: > > The table I am using has 114600 total rows. The full query returns 1129 > > rows. Right now the table is static, i.e. I am not modifying it while I am > > running these tests. > > > Here are the results of the EXPLAIN with the different numbers of OR's. > > > ---QUERY 1, returns 1129 rows--- > > pbi=> [ 32-or variant ] > > Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12) > > > --- QUERY 2, returns 11 rows --- > > pbi=> [ just one WHERE clause ] > > Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57 > > rows=927 width=12) > > > --- QUERY 3, returns 11 rows --- > > pbi=> [ 2 ORed clauses ] > > Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12) > > OK, the problem here is the disconnect between the planner's estimate > of the number of returned rows (the "rows" number in EXPLAIN) and the > actual result row count. If the query actually did need to pull nearly > a thousand rows for each OR'd key, then using a seqscan for more than a > couple of ORs would make sense. But in reality you don't have nearly > that many rows per OR key, so an indexscan is needed. > > That estimate is largely driven by a single statistic collected by > VACUUM ANALYZE, which is the frequency of the most common value in > the item_a column. If the MCV is pretty common then the planner assumes > that the column doesn't have very many distinct values, so you get a > high estimate of the number of returned rows. Evidently you have > a very common MCV for item_a, but the key values you are looking for > aren't nearly that common. > > I have seen a number of cases where someone was using a dummy value > (like 'N/A', or an empty string, etc) to indicate unknown data, and > there were so many of these entries as to not merely be the MCV, > but drive the MCV's frequency statistic far above the frequency of > occurrence of any "real" value. This fools the planner into thinking > that *all* the values are like that, and so it generates plans > accordingly. The long-run solution is to gather more-detailed > statistics, and that's in the TODO list for a future version. In the > meantime, there is a workaround that you may be able to use: instead > of a dummy value, store NULL for unknown entries. The statistics > gatherer already accounts for NULLs separately, so a large fraction > of NULLs won't fool the planner about the frequency of non-null values. > > Note: if you try this, be sure to re-run VACUUM ANALYZE after you > replace the dummies with NULLs. The plans won't change until there's > a more accurate statistical entry for your table... > > regards, tom lane -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com
Kate Collins <klcollins@wsicorp.com> writes: > You are correct in your assessment of the nature of the distribution > of the elements in the item_a column. Some values return none or a > few rows. While other values return many rows (100 or more). It is > not an even distribution. I had never really considered what effect > this would have on searching before. Apparently you have at least one value with close to 2000 rows (IIRC, the rows estimate is basically half the MCV's row count, except in the special case where the planner can see that you are comparing against the MCV itself). If that's a real data value and not a dummy, then of course you can't replace it by NULL. In that case I'm kind of stuck for a good answer for 7.0; we can't do much to distinguish the few-distinct-values case from the many-distinct-values-with-some-frequency-outliers case until we have better statistics than 7.0 keeps. What you could do as a stopgap is to force the system to use an indexscan against its judgment, by doing this before issuing the problem query: SET enable_seqscan = OFF; Naturally I recommend keeping this ON most of the time, but sometimes you just have to get out the blunt instruments ;-) regards, tom lane