Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex |
Date | |
Msg-id | 25859.1248117925@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>) |
Responses |
Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex |
List | pgsql-hackers |
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Tom Lane wrote: >> and it turns out that Intel hasn't seen fit to put a lot of effort into >> the BSR instruction. It's constant time, all right, but on most of >> their CPUs that constant time is like 8 or 16 times slower than an ADD; >> cf http://www.intel.com/Assets/PDF/manual/248966.pdf > hmm interesting - I don't have the exact numbers any more but that > patch(or a previous version of it) definitly showed a noticable > improvement when I tested with sysbench on a current generation Intel > Nehalem... Hmm. I may be overestimating the importance of the smaller size categories. To try to get some trustworthy numbers, I made a quick-hack patch (attachment #1) to count the actual numbers of calls to AllocSetFreeIndex, and measured the totals for a run of the regression tests on both a 32-bit machine and a 64-bit machine. On 32 I got these totals: 0 5190113 1 5663980 2 3573261 3 4476398 4 4246178 5 1100634 6 386501 7 601654 8 44884 9 52372 10 202801 and on 64 these: 0 2139534 1 5994692 2 5711479 3 3289034 4 4550931 5 2573389 6 487566 7 588470 8 155148 9 52750 10 202597 If you want to do the same in some other workload, feel free. I wouldn't trust the results from a single-purpose benchmark too much, though. I then put together a test harness that exercises AllocSetFreeIndex according to these distributions (attachments #2,#3). (Note: I found out that it's necessary to split the test into two files --- otherwise gcc will inline AllocSetFreeIndex and partially const-fold the work, leading to skewed results.) What I'm seeing with this harness on my x86 machines is that __builtin_clz is indeed a bit faster than a naive loop, but not by very much --- it saves maybe 25% of the runtime. It's better on an old PPC Mac; saves about 50%. Still, these are not impressive numbers for a microbenchmark that is testing *only* AllocSetFreeIndex. I'm still interested in the idea of doing a manual unroll instead of relying on a compiler-specific feature. However, some quick testing didn't find an unrolling that helps much. regards, tom lane *** src/backend/storage/ipc/ipc.c.orig Thu Jun 11 10:55:12 2009 --- src/backend/storage/ipc/ipc.c Mon Jul 20 13:56:06 2009 *************** *** 183,188 **** --- 183,190 ---- on_proc_exit_list[on_proc_exit_index].arg); on_proc_exit_index = 0; + + AllocSetPrintStats(); } /* ------------------ *** src/backend/utils/mmgr/aset.c.orig Thu Jun 11 10:55:25 2009 --- src/backend/utils/mmgr/aset.c Mon Jul 20 13:56:19 2009 *************** *** 255,260 **** --- 255,271 ---- #define AllocAllocInfo(_cxt, _chunk) #endif + static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS]; + + void + AllocSetPrintStats() + { + int i; + + for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++) + fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]); + } + /* ---------- * AllocSetFreeIndex - * *************** *** 277,282 **** --- 288,294 ---- size >>= 1; } Assert(idx < ALLOCSET_NUM_FREELISTS); + allocsizes[idx]++; } return idx; /* * usage: time ./testit N * N is a repeat count, set it large enough to get repeatable timings */ #include <stdio.h> #include <stdlib.h> #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 /* number of calls to AllocSetFreeIndex with each result category */ static const int numoccurs[ALLOCSET_NUM_FREELISTS] = { #ifdef USE_64BIT_COUNTS 2139534, 5994692, 5711479, 3289034, 4550931, 2573389, 487566, 588470, 155148, 52750, 202597 #else 5190113, 5663980, 3573261, 4476398, 4246178, 1100634, 386501, 601654, 44884, 52372, 202801 #endif }; extern int AllocSetFreeIndex(size_t size); int main(int argc, char **argv) { int repeat = atoi(argv[1]); while (repeat-- > 0) { int k; for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++) { int n = numoccurs[k]; size_t siz = (1 << (ALLOC_MINBITS + k)); while (n-- > 0) { AllocSetFreeIndex(siz); } } } return 0; } #include <stdio.h> #include <stdlib.h> #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 #define BITS_PER_BYTE 8 /* ---------- * AllocSetFreeIndex - * * Depending on the size of an allocation compute which freechunk * list of the alloc set it belongs to. Caller must have verified * that size <= ALLOC_CHUNK_LIMIT. * ---------- */ int AllocSetFreeIndex(size_t size) { int idx = 0; if (size > (1 << ALLOC_MINBITS)) { size = (size - 1) >> ALLOC_MINBITS; #ifdef HAVE_BUILTIN_CLZ idx = sizeof(unsigned int) * BITS_PER_BYTE - __builtin_clz((unsigned int)size); #else do { idx++; size >>= 1; } while (size != 0); #endif } return idx; }
pgsql-hackers by date: