Re: [PERFORM] A Better External Sort? - Mailing list pgsql-hackers
From | Ron Peacetree |
---|---|
Subject | Re: [PERFORM] A Better External Sort? |
Date | |
Msg-id | 21402654.1127923414088.JavaMail.root@elwamui-polski.atl.sa.earthlink.net Whole thread Raw |
In response to | [PERFORM] A Better External Sort? (Ron Peacetree <rjpeace@earthlink.net>) |
Responses |
Re: [PERFORM] A Better External Sort?
|
List | pgsql-hackers |
>From: "Jeffrey W. Baker" <jwbaker@acm.org> >Sent: Sep 27, 2005 1:26 PM >To: Ron Peacetree <rjpeace@earthlink.net> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort? > >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: > >>That Btree can be used to generate a physical reordering of the data >>in one pass, but that's the weakest use for it. The more powerful >>uses involve allowing the Btree to persist and using it for more >>efficient re-searches or combining it with other such Btrees (either as >>a step in task distribution across multiple CPUs or as a more efficient >>way to do things like joins by manipulating these Btrees rather than >>the actual records.) > >Maybe you could describe some concrete use cases. I can see what >you are getting at, and I can imagine some advantageous uses, but >I'd like to know what you are thinking. > 1= In a 4P box, we split the data in RAM into 4 regions and create a CPU cache friendly Btree using the method I described for each CPU. The 4 Btrees can be merged in a more time and space efficient manner than the original records to form a Btree that represents the sorted order of the entire data set. Any of these Btrees can be allowed to persist to lower the cost of doing similar operations in the future (Updating the Btrees during inserts and deletes is cheaper than updating the original data files and then redoing the same sort from scratch in the future.) Both the original sort and future such sorts are made more efficient than current methods. 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. 3= We have multiple such Btrees for the same data set representing sorts done using different fields (and therefore different Keys). Calculating a sorted order for the data based on a composition of those Keys is now cheaper than doing the sort based on the composite Key from scratch. When some of the Btrees exist and some of them do not, there is a tradeoff calculation to be made. Sometimes it will be cheaper to do the sort from scratch using the composite Key. >Specifically I'd like to see some cases where this would beat sequential >scan. I'm thinking that in your example of a terabyte table with a >column having only two values, all the queries I can think of would be >better served with a sequential scan. > In my original example, a sequential scan of the 1TB of 2KB or 4KB records, => 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Just to clarify the point further, 1TB of 1B records => 2^40 records of at most 256 distinct values. 1TB of 2B records => 2^39 records of at most 2^16 distinct values. 1TB of 4B records => 2^38 records of at most 2^32 distinct values. 1TB of 5B records => 200B records of at most 200B distinct values. From here on, the number of possible distinct values is limited by the number of records. 100B records are used in the "Indy" version of Jim Gray's sorting contests, so 1TB => 10B records. 2KB-4KB is the most common record size I've seen in enterprise class DBMS (so I used this value to make my initial example more realistic). Therefore the vast majority of the time representing a data set by Key will use less space that the original record. Less space used means less IO to scan the data set, which means faster scan times. This is why index files work in the first place, right? >Perhaps I believe this because you can now buy as much sequential I/O >as you want. Random I/O is the only real savings. > 1= No, you can not "buy as much sequential IO as you want". Even if with an infinite budget, there are physical and engineering limits. Long before you reach those limits, you will pay exponentially increasing costs for linearly increasing performance gains. So even if you _can_ buy a certain level of sequential IO, it may not be the most efficient way to spend money. 2= Most RW IT professionals have far from an infinite budget. Just traffic on these lists shows how severe the typical cost constraints usually are. OTOH, if you have an inifinite IT budget, care to help a few less fortunate than yourself? After all, a even a large constant substracted from infinity is still infinity... ;-) 3= No matter how fast you can do IO, IO remains the most expensive part of the performance equation. The fastest and cheapest IO you can do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU operations for IO correctly, more space efficient data representations will always be a Win because of this.
pgsql-hackers by date: