Re: No merge sort? - Mailing list pgsql-hackers
From | Ron Peacetree |
---|---|
Subject | Re: No merge sort? |
Date | |
Msg-id | EHIla.19833$ey1.1702921@newsread1.prod.itd.earthlink.net Whole thread Raw |
In response to | Re: No merge sort? ("Dann Corbit" <DCorbit@connx.com>) |
Responses |
Re: No merge sort?
|
List | pgsql-hackers |
""Dann Corbit"" <DCorbit@connx.com> wrote in message news:D90A5A6C612A39408103E6ECDD77B829408ACB@voyager.corporate.connx.co m... > > From: Ron Peacetree [mailto:rjpeace@earthlink.net]=20 > > Sent: Wednesday, April 09, 2003 12:38 PM > > You only need as many bins as you have unique key values=20 > > silly ;-) Remember, at its core Radix sort is just a=20 > > distribution counting sort (that's the name I learned for the=20 > > general technique). The simplest implementation uses bits as=20 > > the atomic unit, but there's nothing saying you have to...=20=20 > > In a DB, we know all the values of the fields we currently=20 > > have stored in the DB. We can take advantage of that. > > By what magic do we know this? If a database knew a-priori what all the > distinct values were, it would indeed be excellent magic. For any table already in the DB, this is self evident. If you are talking about sorting data =before= it gets put into the DB (say for a load), then things are different, and the best you can do is used comparison based methods (and probably some reasonably sophisticated external merging routines in addition if the data set to be sorted and loaded is big enough). The original question as I understood it was about a sort as part of a query. That means everything to be sorted is in the DB, and we can take advantage of what we know. > With 80 bytes, you have 2^320 possible values. There is no way > around that. If you are going to count them or use them as a > radix, you will have to classify them. The only way you will > know how many unique values you have in > "Company+Division" is to ... > Either sort them or by some means discover all that are distinct Ummm, Indexes, particularly Primary Key Indexes, anyone? Finding the unique values in an index should be perceived as trivial... ...and you often have the index in memory for other reasons already. > . The database does not know how many there are > beforehand. Indeed, there could be anywhere > from zero to 2^320 (given enough space) distinct values. > > I would be interested to see your algorithm that > performs a counting or radix sort on 320 bit bins and that > works in the general case without using extra space. > 1= See below. I clearly stated that we need to use extra space 2= If your definition of "general" is "we know nothing about the data", then of course any method based on using more sophisticated ordering operators than comparisons is severely hampered, if not doomed. This is not a comparison based method. You have to know some things about the data being sorted before you can use it. I've been =very= clear about that throughout this. > > Note TANSTAAFL (as Heinlein would've said): We have to use=20 > > extra space for mapping the radix values to the radix keys,=20 > > and our "inner loop" for the sorting operation is=20 > > considerably more complicated than that for a quicksort or a=20 > > mergesort. Hence the fact that even though this is an O(n)=20 > > technique, in real world terms you can expect only a 2-3X=20 > > performance improvement over say quicksort for realistic=20 > > amounts of data. > >=20 > > Oh and FTR, IME for most "interesting" sorts in the DB=20 > > domain, even for "internal" sorting techniques, the time to=20 > > read data from disk and > > (possibly) write results back to disk tends to dominate the=20 > > time spent actually doing the internal sort... > >=20 Since you don't believe me, and have implied you wouldn't believe me even if I posted results of efforts on my part, go sort some 2GB files by implementing the algorithms in the sources I've given and as mentioned in this thread. (I'm assuming that you have access to "real" machines that can perform this as an internal sort, or we wouldn't be bothering to have this discussion). Then come back to the table if you still think there are open issues to be discussed.
pgsql-hackers by date: